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Abstract 

Our work is motivated by geographical forwarding of sporadic alarm packets to a base station in a wireless sensor 
network (WSN), where the nodes are sleep-wake cycling periodically and asynchronously. When a node (referred 
to as the source) gets a packet to forward, either by detecting an event or from an upstream node, it has to wait 
for its neighbors in a forwarding set (referred to as relays) to wake-up. Each of the relays is associated with a 
random reward (e.g., the progress made towards the sink) that is independent and identically distributed (iid). To 
begin with, the source is uncertain about the number of relays, their wake-up times and the reward values, but knows 
their distributions. At each relay wake-up instant, when a relay reveals its reward value, the source's problem is to 
forward the packet or to wait for further relays to wake-up. In this setting, we seek to minimize the expected waiting 
time at the source subject to a lower bound on the average reward. In terms of the operations research literature, 
our work can be considered as a variant of the asset selling problem. We formulate the relay selection problem as a 
partially observable Markov decision process (POMDP), where the unknown state is the number of relays. We begin 
by considering the case where the source knows the number of relays. For the general case, where the source only 
knows a probability mass function (pmf) on the number of relays, it has to maintain a posterior pmf on the number 
of relays and forward the packet iff the pmf is in an optimum stopping set. We show that the optimum stopping set 
is convex and obtain an inner bound to this set. We prove a monotonicity result which yields an outer bound. The 
computational complexity of the above policies motivates us to formulate an alternative simplified model, the optimal 
policy for which is a simple threshold rule. We provide simulation results to compare the performance of the inner 
and outer bound policies against the simple policy, and against the optimal policy when the source knows the exact 
number of relays. Observing the simplicity and the good performance of the simple policy, we heuristically employ 
it for end-to-end packet forwarding at each hop in a multihop WSN of sleep- wake cycling nodes. 
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I. Introduction 

We are interested in the problem of packet forwarding in a class of wireless sensor networks (WSNs) in which 
local inferences based on sensor measurements could result in the generation of occasional "alarm" packets that 
need to be routed to a base-station, where some sort of action could be taken fl], (j2], ||3]. Such a situation could 
arise, for example, in a WSN for human intrusion detection or fire detection in a large region. Such WSNs often 
need to run on batteries or on harvested energy and, hence, must be energy conscious in all their operations. The 
nodes of such a WSN would be sleep-wake cycling, waking up periodically to perform their tasks. One approach 
for the forwarding problem is to use a distributed algorithm to schedule the sleep-wake cycles of the nodes such that 
the delay of a packet from its source to the sink on a multihop path is minimized [|2l, flU. An organizational phase 
is required for such algorithms, which increases the protocol overhead and moreover the scheduling algorithm has 
to be rerun periodically since the clocks at different nodes drift at different rates (so that the previously computed 
schedule would have become stale after long operation time). For a survey of routing techniques in wireless sensor 
and ad hoc networks and their classification, see Q, 161. 

In this paper we are concerned with the sleep-wake cycling approach that permits the nodes to wake-up inde- 
pendently of each other even though each node is waking up periodically, i.e., asynchronous periodic sleep-wake 
cycling Q, |[T1. In fact, given the need for a long network life-time, nodes are more likely to be sleeping than 
awake. In such a situation, when a node has a packet to forward, it has to wait for its neighbors to wake up. When 
a neighbor node wakes up, the forwarding node can evaluate it for its use as a relay, e.g., in terms of the progress 
it makes towards the destination node, the quality of the channel to the relay, the energy level of the relay, etc., 
(see m, 191 for different routing metrics based on the above mentioned quantities). We think of this as a reward 
offered by the potential relay. The end-to-end network objective is to minimize the average total delay subject to a 
lower bound on some measure of total reward along the end-to-end path. In this paper we address this end-to-end 
objective by considering optimal strategies at each hop. When a node gets a packet to forward, it has to make 
decisions based only on the activities in its neighborhood. Waiting for all potential relays to wake-up and choosing 
the one with the best reward maximizes the reward at each hop, but increases the forwarding delay. On the other 
hand, forwarding to the first relay to wake-up may result in the loss of the opportunity of choosing a node with 
a better reward. Hence, at each hop, there is a trade-off between the one-hop delay and the one-hop reward. By 
solving the one-hop problem of minimizing the average delay subject to a constraint on the average reward, we 
expect to capture the trade-off between the end-to-end metrics. For instance, suppose the end-to-end objective is to 
minimize the expected end-to-end delivery delay subject to an upper bound on the expected number of hops in the 
path, the motivation for this constraint being that more hops traversed entails a greater expenditure of energy in the 
network. In our approach, we would heuristically address this problem by considering at each hop the problem of 
minimizing the mean forwarding delay subject to a lower bound on the progress made towards the sink. Greater 
progress at each hop entails greater delay per hop, while reducing the number of hops it takes a packet to reach 
the sink. 
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The local problem setting is the following. Somewhere in the network a node has just received a packet to 
forward; for the local problem we refer to this forwarding node as the source and think of the time at which it gets 
the packet as 0. There is an unknown number of relays in the forwarding set of the source. In the geographical 
forwarding context, this lack of information on the number of relays could model the fact that the neighborhood 
of a forwarding node could vary over time due, for example, to node failures, variation in channel conditions, 
or (in a mobile network) the entry or exit of mobile relays. However, we assume that the number of relays is 
bounded by a known number K, and the source has an initial probability mass function (pmf), over (1, • • • , K), on 
the number of potential relays. The source desires to forward the packet within the interval [0,T], while knowing 
that the relays wake-up independently and uniformly over [0, T] and the rewards they offer are independently and 
identically distributed (iid). We will formally introduce our model in Section Next we discuss related work and 
highlight our contributions. 

A. Related Work 

Here we provide a summary of related literature in the context of geographical forwarding and channel selection. 
Since our problem also belongs to the class of asset selling problems studied in operations research Uterature, we 
survey related work from there as well. 

Geographical forwarding problems: In our prior work Q we have considered a simple model where the number 
of relays is a constant which is known to the source. There the reward is simply the progress made by a relay node 
towards the sink. In the current work we have generaUzed our earlier model by allowing the number of relays to 
be not known to the source. Also, here we allow a general reward structure. 

There has been other work in the context of geographical forwarding and anycast routing, where the problem 
of choosing one among several neighboring nodes arises. Zorzi and Rao ifTol consider a scenario of geographical 
forwarding in a wireless mesh network in which the nodes know their locations, and are sleep-wake cycling. They 
propose GeRaF (Geographical Random Forwarding), a distributed relaying algorithm, whose objective is to carry a 
packet to its destination in as few hops as possible, by making as large progress as possible at each relaying stage. 
For their algorithm, the authors obtain the average number of hops (for given source-sink distance) as a function 
of the node density. These authors do not consider the trade-off between the relay selection delay and the reward 
gained by selecting a relay, which is a major contribution of our work. 

Liu et al. ifTTl propose a relay selection approach as a part of CMAC, a protocol for geographical packet 
forwarding. With respect to the fixed sink, a node i has a forwarding set consisting of all nodes that make progress 
greater than rp (an algorithm parameter). If Y represent the delay until the first wake-up instant of a node in the 
forwarding set, and X is the corresponding progress made, then, under CMAC, node i chooses an rp that minimizes 
the expected normalized latency E[^]. The Random Asynchronous Wakeup (RAW) protocol lfT2l also considers 
transmitting to the first node to wake-up that makes a progress of greater than a threshold. Interestingly, this is the 
structure of the optimal poUcy for our simplified model in Q. For the sake of completeness we have described the 
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simplified model in this paper as well (see Section I VII ). Thus we have provided analytical support for using such 
a threshold poUcy. 

Kim et al. |[T| consider a dense WSN. Just like the motivation for our model, an occasional alarm packet needs to 
be sent, from wherever in the network it is generated, to the sink. The authors develop an optimal anycast scheme 
to minimize average end-to-end delay from any node i to the sink when each node i wakes up asynchronously 
with rate r^. They show that periodic wake-up patterns obtain minimum delay among all sleep-wake patterns with 
the same rate. They propose an algorithm called LOCAL-OPT fVi\ which yields, for each node i, a threshold /ij*^ 
for each of its neighbor j. If the time at which neighbor j wakes up is less than /ij*', then i will transmit to j. 
Otherwise j will go back to sleep and i will continue waiting for further neighbors. A key drawback is that a 
configuration phase is required to run the LOCAL-OPT algorithm. 

Rossi et al. llT4l . consider the problem where a node i, with a packet to forward and which is n hops away from 
the sink, has to choose between two of its shortlisted neighbors. The first shortlisted neighbor is the one with the 
least cost among all others with hop count n — 1 (one less than node i). The second one is the least cost node 
among all its neighbors with hop count n (same as that of node i). Though the first node is on the shortest path, 
sometimes when its cost is high, it may not be the best option. It turns out that it is optimal to choose one node 
over the other by comparing the cost difference with a threshold. The threshold depends on the cost distribution 
of the nodes which are two hops away from node i. Here there is no notion of sleep-wake cycling so that all the 
neighbor costs are known when node i gets a packet to forward. The problem is that of one shot decision making. 
In our problem a neighbor's cost will become available only after it wakes up, at which instant node i has to take 
decision regarding forwarding. Hence, ours is a sequential decision problem. 

Channel selection problems: Akin to the relay selection problem is the problem of channel selection. The authors in 
ifTSl . lfT6l consider a model where there are several channels available to choose from. The transmitter has to probe 
the channels to learn their quality. Probing many channels yields one with a good gain but reduces the effective 
time for transmission within the channel coherence period. The problem is to obtain optimal strategies to decide 
when to stop probing and to transmit. Here the number of channels is known and all the channels are available 
at the very beginning of the decision process. In our problem the number of relays is not known, and the relays 
become available at random times. 

Asset selling problems: The basic asset selling problem ifTTl . ifTSll . comprises N offers that arrive sequentially over 
discrete time slots. The offers are iid. As the offers arrive, the seller has to decide whether to take an offer or wait 
for future offers. The seller has to pay a cost to observe the next offer Previous offers cannot be recalled. The 
decision process ends with the seller choosing an offer. Over the years, several variants of the basic problem have 
been studied, both with and without recalling the previous offers. Recently Kang |fT9l has considered a model where 
a cost has to be paid to recall the previous best offer Further, the previous best offer can be lost at the next time 
instant with some probability. See ||T9| for further references to literature on models with uncertain recall. In ||20|. 
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the authors consider a model in which the offers arrive at the points of a renewal process. Additional literature on 
such work can be found in ||20| . In these models, either the number of potential offers is known or is infinite. In 
En . a variant is studied in which the asset selling process can reach a deadline in the next slot with some fixed 
probability, provided that the process has proceeded upto the present slot. 

In our work the number of offers (i.e., relays) is not known. Also the successive instants at which the offers 
arrive are the order statistics of an unknown number of iid uniform random variables over an interval [0,T]. After 
observing a relay, the probability that there are no more relays to go (which is the probability that the present stage 
is the last one) is not fixed. This probability has to be updated depending on the previous such probabilities and the 
inter wake-up times between the sucessive relays. Although our problem falls in the class of asset selling problems, 
to the best of our knowledge the particular setting we have considered in this paper has not been studied before. 

B. Our Contributions 

With the number of relays being unknown, the natural approach is to formulate the problem as a partially 
observed Markov decision process (POMDP). A POMDP is a generalization of an MDP, where at each stage the 
actual internal state of the system is not available to the controller Instead, the controller can observe a value 
from an observation space. The observation probabilistically depends on the current actual state and the previous 
action. In some cases, a POMDP can be converted to an equivalent MDP by regarding a belief (i.e., a probability 
distribution) on the state space as the state of the equivalent MDP. For a survey of POMDPs see ll22l . It is clear 
that, even if the actual state space is finite, the belief space is uncountable. There are several algorithms available to 
obtain the optimal policy when the actual state space is finite ll23l . starting from the seminal work by Smallwood 
and Sondik ll24l . When the number of states is large, these algorithms are computationally intensive. In general, 
it is not easy to obtain an optimal policy for a POMDP. In the current work, we have characterized the optimal 
poUcy in terms of an optimum stopping set. We have made use of the convexity results in ||25]| and some properties 
specific to our problem to obtain an inner bound on the optimum stopping set. We prove a simple monotonicity 
result to obtain an outer bound. In summary, the following are the main contributions of our work: 

• We formulate the problem of relay selection with partial information as a finite horizon partially observable 
Markov decision process (POMDP), with the unknown state being the actual number of relays (Section Ullb . 
The posterior pmf on the number of relays is shown to be a sufficient decision statistic. 

• We first consider the completely observable MDP (COMDP) version of the problem where the source knows 
the number of relays with probability one (wpl) (Section UVT i. The optimal policy is characterized by a sequence 
of threshold functions. 

• For the POMDP, at each stage the optimum stopping set is the set of all pmfs on the number of relays where 
it is optimal to stop (Section [V]i. We prove that this set is convex (Section [V-Al l. and provide an inner bound 
{subset) for it (Section Fy-BI ). We prove a monotonicity result and obtain an outer bound (superset, Section [V-CI ). 
The threshold functions obtained in COMDP version are used in the design of the bounds. These threshold 
functions need to be obtained recursively which is in general, computationally intensive. 
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• The complexity of the above policies motivates us to consider a simplified model (Section fVlll. We prove that 
the optimal policy for this simplified model is a simple threshold rule. 

• Through simulations (Section IVII-Ab we study the performance comparision of various policies with the 
optimal COMDP policy. The inner bound policy performs slighty better than the outer bound policy. The 
simple policy obtained from the simplified model performs very close to the inner bound. Also, we show 
the poor performance of a naive policy, that assumes the actual number of relays to be simply the expected 
number. 

• Finally as a heuristic for the end-to-end problem in the geographical forwarding context, we apply the simple 
policy at each hop and study the end-to-end performance by simulation (Section IVII-Bb . We find that it is 
possible to tradeoff between the expected end-to-end delay and expected number of hops by tuning a parameter. 

For the ease of presentation, in the main sections we only provide an outline of the proof for most of the lemmas, 
followed by a brief description. Formal proofs are available in Appendices H] HI] and |lll] Appendix |IV] contains 
additional simulation results. 

II. System Model 

We consider the one stage problem in which a node in the network receives a packet to forward. We call this node 
the "source" and the nodes that it could potentially forward the packet to are called "relays". The local problem 
is taken to start at time 0. Thus at time 0, the source node has a packet to forward to a sink but needs a relay 
node to accomplish this task. There is a nonempty set of N relay nodes, labeled by the indices 1, 2, • • • ,N. N is 
a random variable bounded above by K, a system parameter that is known to the source node, i.e., the support 
of iV is {1, 2, • • • , K}. The source does not know N, but knows the bound K, and a pmf po on {1, 2, • • • , A'}, 
which is the initial pmf of N. A relay node i, I < i < N, becomes available to the source at the instant T^. 
The source knows that the instants {Ti} are iid uniformly distributed on (0,T). Observe that this would be the 
case if the wake-up instants of all the nodes in the network are periodic with period T, if these (periodic) renewal 
processes are stationary and independent, and if the forwarding node's decision instants are stopping times w.r.t. 
these wake-up time processes ll26l . 

We call Ti the wake-up instant of relay i. If the source forwards the packet to the relay i, then a reward of Ri is 
accrued. The rewards Ri,i = 1,2, N, are iid random variables with pdf f^. The support of fn is [0, R]. The 
source knows this statistical characterisation of the rewards, and also that the {Ri} are independent of the wake-up 
instants {Ti}. When a relay wakes up at Ti and reveals its reward Ri, the source has to decide whether to transmit 
to relay i or to wait for further relays. If the source decides to wait, then it instructs the relay with the best reward 
to stay awake, while letting the rest go back to sleep. This way the source can always forward to a relay with the 
best reward among those that have woken up so far. 

Given that N — n (throughout this discussion we will focus on the event {N — n)), let Wi , W2, ■ ■ ■ , Wn represent 
the order statistics of Ti,T2, ■ ■ ■ ,Tn, i.e., the {T40s} sequence is the {T} sequence sorted in the increasing order. 
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The pdf of the fc th (fc < n) order statistic |27, Chapter 2] is, for < u < T, 

^ (fc-l)!(n-fc)!T«- 
Also the joint pdf of the k th and the £ th order statistic (for A; < £ < n) is, for < u < v < T, 

fw,,wM^^^H - ^k-m£-k-mn-iy.r-- 

Using the above expressions, we can write down the conditional pdf fwk+e\Wk,N (for 1 < £ < n — fc) as, for 

0<w<TandO<u<T-w, 

fWk,Wk+e\N{w,W + u\n) 



fw,+i\w,,Niw + u\w,n) = 

JWk\N[w\n) 
{n ~ k)\u'^-\{T ~ w) ~ u)^''-^'^-^ 



{£ - l)!((n - fc) - £)\{T - ' 
Comparing ^ with as expected, we observe that, given N ~ n, the pdf of the wake-up instant of the {k + £) th 
node, conditioned on the wake-up instant of the fc th node, is the £ th order statistic of {n — fc) iid random variables 
that are uniform on the remaining time {T — w). Let Wq = and define Uk — Wk — Wk-^i for fc = 1, 2, • • • , n. 
Uk are the inter-wake-up time instants between the consecutive nodes (see Fig. [T]). Later we will be interested in 
the conditional pdf fuk+i\Wk.N for fc = 0, 1, • • • ,n — 1 which is given by, for < w < T and < u < T — w. 



fuk+,\Wk,N{u\w,n) = fwu,+,\w,,,N{w + u\w,n) 

{n - k){T - w - uY'-^-^ 
^ (T - w)"-'= ■ 

The conditional expectation is given by. 



(4) 



T — w 

nUk+i\Wk^w,N = 7i] = -— , (5) 

71 — fc + 1 

which is simply the expected value of the minimum of ?i — fc random variables (n — fc is the remaining number of 
relays), each of which are iid uniform on the interval [0, T — w) (T — w is the remaining time). 
Definition 1: For notational simplicity we define, 

fk{u\w,n) := .fuk+i\Wk.N{u\w,n) 
Efe[-|w,n] := E[-|VKfc = w,iV = n] 

Note that fk{-\w,n) depends on n and fc through the difference n — k and depends on w through T — w. ■ 
Since the reward sequence i?2, • • • , Rn is iid and independent of the wake-up instants Ti,T2, - ■ ■ , Tn, we write 
{Wk, Rk) as the pairs of ordered wake-up instants and the corresponding rewards. Evidently, fR^^^\Wk.N{'''\w, n) = 
fnir) for fc = 0, 1, • • • , n - 1. Further we define (when N = n) Wn+i := T, Un+i := (T - W„) and i?„+i := 0. 
Also En[Un+i\w,n] := T — w. All these variables are depicted in Fig. [T] We end this section by listing out, in 
Table H] most of the symbols that appear in the paper with a brief description for each. 



8 



■f(»'-i-i,i?,.-i) 



-^(Wl R2) 



-fm.Rk) 



Ui U-2 U-i 



C/,,+1 = T - W„ 



Fig. 1. There are N = n relays. {W}^, Rk) represents the wake-up instant and reward repectively, of the fcth relay. These are shown as points 
in [0,T] X [0,^]. Uk are the inter-wake-up times. Note that W„+i = T, i?„+i = and C/„+i = T - W„. 



III. The Sequential Decision Problem 

For the model set up in Section |II] we now consider the following sequential decision problem. At each instant 
that a relay wakes up, i.e., Wi, W2, • • • , the source has to make the decision to forward the packet, or to hold the 
packet until the next wake-up instant. Since the number of available relays, N, is unknown, we have a decision 
problem with partial information. We will show how the problem can be set up in the framework of a partially 
observable Markov decision process (POMDP) ||22l |l28l Chapter 5]. 

A. Actions, State Space, and State Transition 

Actions: We assume that the time instants at which the relays wake-up, i.e., Wi, • • •, constitute the decision 
instants or stages^ At each decision instant, there are two actions possible at the source, denoted and 1, where 

• represents the action to continue waiting for more relays to wake-up, and 

• 1 represents the action to stop and forward the packet to the relay that provides the best reward among those 
that have woken up to the current decision epoch. 

Since there can be at most K relays, the total number of decision instants is K. The decision process technically 
ends at the first instant Wk, at which the source chooses action 1, in which case we assume that all the subsequent 
decision instants, fc + 1, • • • ,K, occur at Wk- In cases where the source ends up waiting until time T (referring to 
Fig. [1] this is possible if, even at Wn the source decides to continue, not realizing that it has seen all the relays 
there are in its forwarding set), all the subsequent decision instants are assumed to occur at T. 



'a better choice for the decision instants may be to allow the source to take decision at any time t G (0,T]. When A'^ is known to the 
source it can be argued that it is optimal to take decisions only at relay wake-up instances. However this may not hold for our case where A'^ 
is unknown. In this paper we proceed with our restriction on the decision instants and consider the general case as a topic for future work. 
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Symbol 


Description 


{a,b) 


Inner product of vectors a and b 


aj,(w, b) 
bU^,b) 


ik) ffc+^) 

Thresholds lying on the line joining pj. and pj. of the simplex V^; Used in the construction of the inner and 
outer bounds, respectively 


Bk 


Best reward so far, i.e., = max{i?l, ■ ■ ■ , i?^} 


Ck{p,w,b) 


Average cost of continuing at stage k when the state is (p,«i,f)) 


Ck{w,b) 


Optimum stopping set at stage k when {W^, Bf^) = (w.b) 




Inner bound for the stopping set Ci^(w, b) 


Ck(w,b) 


Outer bound for the stopping set Ck{w,b) 


^Istep 


One-step-stopping set for the simplified model 


Efc[-|u),n] 


Expectation conditioned on {W}^,N) = (io,n) 


fki-\w,n) 


pdf of f/fe^i conditioned on {Wf^,N) = (ui,n) 


fni-) 


pdf of the iid rewards {Rk} 


Jk{p,w,b) 


Optimal cost-to-go function at stage k when the state is {j>,w,b) 


K 


Bound on the number of relays 


N 


Number of relays; random variable taking values from {1, 2, • ■ ■ , A'} 


N 


Number of relays in the simplified model; a constant 


P(A) 


Probability of an event A 


Pk 


Set of all pmfs on the set {fc, fe + 1, • • • , K} 


(p, w, b) 


Represents a typical state at stage k where p G Vk is the belief state and (Wfe, Bf^) = {w, b) 


(n) 
Pk 


A comer point in "Pk, i.e., p^"'(n) = 1 


Rk 


Reward of the k th relay 


Uk+i 


Inter wake-up time between the k + 1 and k th relay, i.e., Uk+i = Wk+i — Wk 


Wk 


Wake-up instant of the k th relay 


Wk,Rk, Uk+i 


Quantities, analogous to the ones in the exact model, for the simplified model 


a 


Threshold obtained from the simplified model 


7 


Reward constraint for the problem in 11 It 


5„-k(w,b) 


When p G Vk is such that p{k) + p(n) = 1 then it is optimal to stop iff p(n) < S„^k{w, b) 


V 


Lagrange multiplier, see )12t 


—rjb 


Average cost of stopping at stage k when B^ = b 


Tk + l{p,W,u) 


Belief transition function; t^+i (p, m) is a pmf in Vk+l for a given p g Vk, Wk = w and Uk+i = u 


<j>„-k(w,b) 


Threshold obtained from the COMDP version of the problem; If the source knows wpl that N = n, then at some 
stage k <n with {W^, Bf.) = {w, b) it is optimal to stop iff 6 > <p,^_k{w, b) 



TABLE I 

List of mathematical notation. 



State Space: At stage the state space is simply <Sq = 0, 0) : 1 < n < Kj and the only action possible is 0, 
where a in the superscript is to signify that Sq is the set of actual internal states of the system. The state space at 
stage 1 is, 

S'^ = :1 <Ti< K,w £ {0,T),be [0,i?]| 
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and for stages A; = 2, 3, ■ • ■ , 



•a 



= i^^{n,w,b):k<n<K,we{0,T),b&[0,R] 
u{(fc-i,r,6) :6e [0,R]^U{^} 



(6) 



Thus the state space at stage A; = 2, 3, • • • ,K is written as the union of three sets. The physical meanings of these 
sets are as follows: 

• 5^(1): n in the state triple (n, w, b) represents the actual number of relays. The states in this set correspond 
to the case where there are more than or equal to k relays, i.e., n satisfies, k < n < K. In the pair {w, b), w 
is the wake-up instant (Wk) of the k th relay, and b is the best reward {Bk = max{i?i, • • • ,Rk}) among the 
relays seen so far. Same remark holds for the states in 5". Stage begins at time with reward. Hence the 
states in are of the form (n, 0, 0). 

• 5^(2): Suppose there were k — 1 relays and, at stage — 1 the source decides to continue. Note that it is 
possible for the source to take such a decision, since it does not know the number of relays. In such a case, the 
source ends up waiting until time T and enters stage k. Hence the states in this set are of the form (fc - 1, T, &) 
where b represents the best reward among all the k — 1 relays (Bk-i). 

• 5^(3): V is the terminating state. The state at stage k will be V', if the source has already forwarded the packet 
at an earlier stage. 

State Transition: If the state at stage fc is t/j (i.e., the source has already forwarded the packet) then the next state 
is always ijj. Suppose (n, w, b) e is the state at some stage k, {) < k < K — 1, and Ofe G {0, 1} represents the 
action taken. If Ofe = 1 then the decision process stops and we regard that the system enters the termination state 
■0 so that the state at all the subsequent stages, fc + I, • • • , if , is ijj. The source will also terminate the decision 
process, knowing that the relays wake-up within the interval (0, T), if it has waited for a duration of T. This means 
that {n, w, b) e S^{2), i.e., n = fc - 1 and w = T. 

On the other hand if {n,w,b) G ^^^{l) and Uk — 0, the source waits for a random duration of Uk+i and 
encounters a relay with a random reward of Rk+i so that the next state is (n, w + Uk+i, max{fe, Rk+i})- Note that 
if n = fc, i.e., the current relay is the last one, then since we have defined Uk+i = T — w and Rk+i — 0, the next 
state will be of the form (fc, T, b). Thus the state at stage fc + 1 can be written down as. 



B. Belief State and Belief State Transition 

Since the source does not know the actual number of relays N, the state is only partially observable. The source 
takes decisions based on the entire history of the wake-up instants and the best rewards. If the source has not 




^p if w — T and/or ak ~ 1 



1, max{6, j otherwise. 



(7) 
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forwarded the packet until stage fc — 1 then define, Ik — {po, {wi, bi), • • • , {wk, bk)) to be the information vector 
available at the source when the k th relay wakes up. wi , • • • ,Wk represents the wake-up instants of relays waking 
up at stages 1, • • • ,k and 6i, • • • ,bk are the corresponding best rewards. Define pk to be the belief state about N 
at stage k given the information vector Ik, i.e., Pk{n) — T{N ~ n\Ik) for n = /c. fc + 1, • • • , if (note that pk{k) is 
the probability that the k th relay is the last one). Thus, pk is a pmf in the K ~ k dimensional probability simplex. 
Let us denote this simplex as Vk- 

Definition 2: For fc = 1, 2, • • • , let Vk'-= set of all pmfs on the set {fc, fc + 1, • • • , K}. Vk is the K — k 
dimensional probability simplex in "R^"^^^. ■ 

The "observation" [wk, bk) at stage fc is a part of the actual state (n, Wk, bk). For a general POMDP problem the 
observation can belong to a completely different space than the actual state space. Moreover the distribution of the 
observation at any stage can in general depend on all the previous states, observations, actions and disturbances. 
Suppose this distribution depends only on the state, action and disturbance of the immediately preceding stage, 
then a belief on the actual state given the entire history turns out to be sufficient for taking decisions ll28l Chapter 
5]. For our case, this condition is met and hence at stage fc, {pk,Wk,bk) is a sufficient statistic to take decision. 
Therefore we modify the state space as, <So = {(p, 0, 0) : p G Pi} and for fc = 1, 2 • • • ,K, 

Sk ^ [{p,w,b) : p eVk^w e {0,T],b e [0, i?] } u {^}. (8) 

After seeing fc relays, suppose the source chooses not to forward the packet, then upon the next relay waking up 

(if any), the source needs to update its belief about the number of relays. Formally, if (p, w, b) G Sk is the state 

at stage fc and w + u is the wake-up instant of the next relay then, using Bayes rule, the next belief state can be 

obtained via the following belief state transition function which yields a pmf in Pk+i, 

/ w N p{n)fk{u\w,n) 

Tk+i {p, w, u) in) = —j^ (9) 

22e=k+iPi^)fk{u\w,e) 

for n = fc + 1, • • ■ ,K. Note that this function does not depend on b. Thus, if at stage fc G {0, 1, • • • ,K — 1}, the 
state is (p, w, b) G Sk, then the next state is 

tp if w = T and/or ak = 1 
Sfc+i = { / \ (10) 

Tk+i{p,w,Uk+i),w + Uk+i,max{b,Rk+i} otherwise. 



where Uk+i is the random delay until the next relay wakes up and Rk+i is the random reward offered by that relay. 
The explanation for the above belief state transition expression remains same as that of the actual state transition 
in (|7]i, except that if the action is to continue, then the source needs to update the belief about the number of 
relays. Suppose at stage fc, the actual number of relays happens to be fc and the action is to continue, which is 
possible since the source does not know the actual number, then the source will end up waiting until time T and 
then transmit to the relay with the best reward. 

C. Stopping Rules and the Optimization Problem 

As the relays wake-up, the source's problem is to decide to stop or continue waiting for further relays. A stopping 
rule or a policy tt is a sequence of mappings (/ii, • • • , where /ife : iSfc — > {0, 1}. Let 11 represent the set of all 
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policies. The delay Dj^ incurred using policy tt is the instant at which the source forwards the packet. It could be 
either one of the Wk, or the instant T. The reward i?^ is the reward associated with the relay to which the packet 
is forwarded. The problem we are interested in is the following. 



min E[-D7rl 
Tren 



Subject to E[i?^]>7. (11) 
To solve the above problem, we consider the following unconstrained problem, 

min (e[D^] - 7jE[R^]) (12) 

where 77 > 0. 

Lemma 1: Let tt* be an optimal policy for the unconstrained problem in ( fTSl l. Suppose that 77 (=: r\^') is such 
that E[i?7r»] = 7, then tt* is optimal for the main problem in ( [TT] ) as well. 

Proof: For any policy tt satisfying the constraint E[i?7r] > 7 we can write, 

E\D^.\ < E[D^]-7^^(e[R^]-E[R^,]'^ 

= E[i5,] -77^(E[i?,] -7) 

< nD^], 

where the first inequality is by the optimality of tt* for ( fT2] ). the equality is by the hypothesis on 77-y, and the last 
inequality is due to the restriction of tt to E[i?7r] > 7- ■ 

Hence we focus on solving the unconstrained problem in ( IT2b . 

D. One-Step Costs 

The objective in (fT2l i can be seen as accumulating additively over each step. If the decision at a stage is to 
continue then the delay until the next relay wakes up (or until T) gets added to the cost. On the other hand if the 
decision is to stop then the source collects the reward offered by the relay to which it forwards the packet and the 
decision process enters the state t/j. The cost in state t/j is 0. Suppose {p, w, b) is the state at stage k. Then the 
one-step-cost function is, for fc 0, 1, • • ■ ,K — 1, 

( \ I ^'^^ if w = T and/or = 1 

9k[{p,w,b),ak) = < (13) 
I Uk+i otherwise. 

The cost of termination is gxip, w, b) = —rjb. Also note that for k = Q, the possible states are of the form (p, 0, 0) 
and the only possible action is aq ~ 1, so that .go(^(Pi 0, 0), oq^ = Ui. 

E. Optimal Cost-to-go Functions 

For /c = 1, 2, ■ • • .K,\q\. Jk{) represent the optimal cost-to-go function at stage k. For any state Sk G Sk, Jk{sk) 
can be written as, 

Jk{sk) = min{stopping cost , continuing cost}, (14) 
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where stopping cost {continuing cost) represents the average cost incurred, if the source, at the current stage decides 
to stop (continue), and takes optimal action at the subsequent stages. For the termination state, since the one step 
cost is zero and since the system remains in if) in all the subsequent stages, we have Jk{ip) = 0. For a state 
{p, w, b) G Sk, we next evaluate the two costs in the above expression. 

First let us obtain the stopping cost. Suppose that there were K relay nodes and the source has seen them all. In 
such a case if {p, w, b) G Sk (note that p will just be a point mass on K) is the state at stage K then the optimal 
cost is simply the cost of termination, i.e., Jk{p, w, b) — gxip, w, b) = —776. For k = 1,2, ■ ■ ■ , i^T — 1, if the action 
is to stop then the one step cost is —rjb and the next state is so that the further cost is Jfc+i(V') 0. Therefore, 
the stopping cost at any stage is simply —776. 

On the other hand the cost for continuing, when the state at stage k is {p, w, b), using the total expectation law, 
can be written as. 



Ckip,w,b) = p{k){T - w - rjb^ 



K 



Uk+i + Jk+i{Tk+i{p,w,Uk+i),w + C/fe+i,max{6, Rk+i}^ 



(15) 



Each of the expectation term in the summation in ( fTSl l is the average cost to continue conditioned on the event 
{N = n). Uk+i is the (random) time until the next relay wakes up {Uk+i is the one step cost) and Jk+i{-) is the 
optimal cost-to-go from the next stage onwards {Jk+i{-) constitutes the future cost). The next state is obtained via 
the state transition equation ( fTOl i. The term {T — w — rjb) in ( fTSl l associated with p{k) is the cost of continuing when 
the number of relays happen to be k, i.e., (N = k) and there are no more relays to go. Recall that we had defined 
(in Section HH) Uk+i = T — w and Rk+i = when the actual number of relays is N = k. Therefore T — w is the 
one step cost when N = k. Also w + Uk+i = T and max{6, Rk+i} = 6 so that at the next stage (which occurs at 
T) the process will terminate (enter ip) with a cost of —776 (see ( fTOl l and (fTSTi), which represents the future cost. 

Thus the optimal cost-to-go function ( fT4b at stage k — 1,2, - ■ ■ ,K~\, can be written as, 

Jk{p,w,b) =min| - 776, Cfe(p, w, (16) 

From the above expression it is clear that at stage k when the state is (p, w, b), the source has to compare the 
stopping cost, —7/6, with the cost of continuing, Ck{p, w, b), and stop iff —rjb < Ck{p, w, b). Later in Section [Vl we 
will use this condition {—rjb < Ck{p, w, b)) and define, the optimum stopping set. We will prove that the continuing 
cost, Ck{p, w, b), is concave in p, leading to the result that the optimum stopping set is convex. ( fTSl l and ( fTSb are 
extensively used in the subsequent development. 

IV. Relationship with the Case Where N is Known (the COMDP Version) 

In the previous section (Section Ullb we detailed our problem formulation as a POMDR The state is partially 
observable because the source does not know the exact number of relays. It is interesting to first consider the simpler 
case where this number is known, which is the contribution of our earlier work in jT). Hence, in this section, we 
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will consider the case when the initial pmf, po, has all the mass only on some n, i.e., po{n) = 1. We call this, the 
COMDP version of the problem. 

First we define a sequence of threshold functions which will be useful in the subsequent proofs. These are the 
same threshold functions that characterize the optimal policy for our model in Q. 



Definition 3: For (w, b) e (0, T) x [0, R], define : £ = 0, 1, • • ■ ,K—l} inductively as follows: (/)o(w, b) = 
for all {w, b), and for £ = 1, 2, • • • ,K-1 (recall Definition [TJ, 



(t)i{w,b) = Ek-£ 



6, i?, 4>i-i {w + U, max{6, R] 









ui, K 





(17) 



In the above expression we have suppressed the subscript K — + 1 for R and U for simplicity. The pdf used to 
take the expectation in the above expression is fB.{-)fK-t{'\w,K) (again recall Definition[l). ■ 
We will need the following simple property of the threshold functions in a later section. 

Lemma 2: For £ = 1, 2, • • • , _ftr — 1, —rjipilw, b) < {T ~ w ~ rjb). 

Proof: See Appendix II- Al ■ 

Next we state the main lemma of this section. We call this the One-point Lemma, because it gives the optimal 
cost, Jk{pk,w, b), at stage k when the belief state pk £ Vk is such that it has all the mass on some n > k. 

Lemma 3 (One-point): Fix some n £ {1,2,- •■ , K} and {w,b) £ (0,T) x [0,i?]. For any k = 1,2,- •• ,n, if 
Pk G T^k is such that pk (n) = 1 then. 



Jk{pk, w, b) = min | - r/b, -rjtfin^kiw, 



Proof: The proof is by induction. We make use of the fact that if at some stage k < n the belief state pk is 
such that Pk{n) — 1 then the next belief state pk+i{G Vk+i), obtained by using the belief transition equation (|9]l, 
is also of the form pk+i{n) — 1. We complete the proof by using Definition |3] and the induction hypothesis. For a 
complete proof, see Appendix II-BI ■ 
Discussion of Lemma \3} At stage k if the state is {pk,w,b), where pk is such that Pk{n) = 1 for some n > k, 
then from the One-point Lemma it follows that the optimal policy is to stop and transmit iff b > (j)n-k{w, b). The 
subscript n — k of the function (pn-k signifies the number of more relays to go. For instance, if we know that there 
are exactly 4 more relays to go then the threshold to be used is 04. Suppose at stage k if it was optimal to continue, 
then from (|9]) it follows that the next belief state Pk+i G Vk+i also has mass only on {N ~ n) and hence at this 
stage it is optimal to use the threshold function (/)„_(fc+i). Therefore, if we begin with an intial belief po e Vi 
such that poin) = 1 for some n, then the optimal policy is to stop at the first stage k such that b > </i„_s.(w, 6) 
where Wk = w is the wake-up instant of the k th relay and ~ max{i?i, • • • , R^} = b. Note that, since at stage 
n the threshold to be used is (j>(){w, fe) = (see Definition O, we invariably have to stop at stage n if we have not 
terminated earlier. This is exactly the same as our optimal policy in Q, where the number of relays is known to 
the source (instead of knowing the number wpl, as in our One-point Lemma here). ■ 
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V. Unknown N: Bounds on the optimum stopping set 



In this section we will consider the general case where the number of relays N is not known to the source. The 
sequential decision problem developed in Section |lll] was for this unknown N case. The problem was formulated 
as a POMDP for which the source's decision to stop and forward the packet is based on the belief state which 
takes values in Vk after the source has observed k relays waking up. We begin this section by defining the optimum 
stopping set. We show that this set is convex. Characterizing the exact optimum stopping set is computationally 
intensive. Therefore, our aim is to derive inner and outer bounds (a subset and a superset, respectively) for the 
optimum stopping set. 



Definition 4 (Optimum stopping set): For 1 < fc < — 1, let Ck{w,b) = <p ^ Vk ■ —J]^ < Ck{p,w,b) 



Referring to ( fTSI l it follows that, for a given (w, b), Ck{w, b) represents the set of all beliefs p G Vk at stage k at 
which it is optimal to stop. We call Ck{w,b) the optimum stopping set at stage k when the delay (Wk) and best 



A. Convexity of the Optimum Stopping Sets 

We will prove (in Lemma |4|i that the continuing cost, Ck{p,w,b), in (fTSl l is concave in p G Vk- From the form 
of the stopping set Ck{w, b), a simple consequence of this lemma will be that the optimum stopping set is convex. 
We further extend the concavity result of Ck{p, w, b) for p ^ Vk, where Vk is the affine set containing Vk (to be 
defined shortly in this section). 

Lemma 4: For A; = 1, 2, • • • , 7^ — 1, and any given {w, b), the cost of continuing (defined in (fTSl l). Ck{-, w, b), is 
concave on Vk- 

Proof: The essence of the proof is same as that in ||25l Lemma 1]. From ( fTSt we easily see that ca'_i(-, w, b) 
is an affine function of p G Vk-i, and hence JK-i{-,w,b), in (fT6b . being minimum of an affine function and a 
constant is concave. The proof then follows by induction. The induction hypothesis is that for some stage fc + 1, 
Jk+i{-,w,b) is concave. Hence it can be expressed as an infimum over some collection of affine functions. The 
inductive step then shows that Ck{-,w,b) can also be similarly expressed as an infimum over some collection of 
affine functions. Hence Ck{-, w, b) and (usingfTSt Jfc(-, w, b) are concave. Formal proof is available in Appendix III- Al 

■ 

The following corollary is a straight forward application of the above lemma. 

Corollary 1: For /c = 1, 2, • • • , A' — 1, and any given [w, b), Ck{w, b){C Vk) is a convex set. 

Proof From Lemma |4] we know that Ck{p,w,b) is a concave function of p G Vk- Hence Ck{w,b) (see 
Definition lU, being a super level set of a concave function, is convex 1291 . ■ 

In the next section while proving an inner bound for the stopping set Ck{w, b), we will identify a set of points 
that could lie outside the probability simplex Vk- We can obtain a better inner bound if we extend the concavity 
result to the affine set. 



reward (Bk) values are w and b, respectively. 



Vk 




.K-k+l 



(P,l) = l}, 
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where (p, 1) — ^^^kP{n), i.e., in Vk the vectors sum to one, but we do not require non-negativity of the vectors. 
This can be done as follows. Define r/c+i (p, w, w) using ^ for every p G Vk- Then Tfe+i(.,u>,M) as a function of 
p, is the extension of rfe+i(., w, u) from Vk to Vk- Similarly, for every p G Vk, define Cfe(p, w, 6) and Jk{Pi w, b) 
using ( fTSI l and ( fTSI l. These are the extensions of ca:(-, w, b) and Jfe(-, w, b) respectively. Then again, using the proof 
technique same as that in Lemma |4] we can obtain the following corollary. 

Corollary 2: For k ^ 1,2, ■ ■ ■ ,K — 1, and any given {w, b), Ck{-, w, b) is concave on the affine set Vk- B 
Using the above corollary, Ck{w, b) can be written as, 

Ck{w,b) = 7'fen{pGK^-'=+i : = 1,-776 <c,(p,«;, 6)}. (18) 

B. Inner Bound on the Optimum Stopping Set 

We have showed that the optimum stopping set is convex. In this section, we will identify points that lie along 
certain edges of the simplex Vk- A convex hull of these points will yield an inner bound to the optimum stopping set. 
This will first require us to prove the following lemma, referred to as the Two-points Lemma, and is a generalization 
of the One-point Lemma (Lemma [3]i. It gives the optimal cost, Jfc(p, w, b), at stage k when p G Vk is such that it 
places all its mass on k and on some n > k, i.e., p{k) + p{n) — 1. Throughout this and the next section (on an 
outer bound) {Wk,Bk) = (w,b) is fixed and hence, for the ease of presentation (and readability), we drop {w,b) 
from the notations Sf {w, b), a^(w, b) and fe^(w, b) (to appear in these sections later). However it is understood that 
these thresholds are, in general, functions of (w,b). 

Lemma 5 (Two-points): For k = 1,2, ■ ■ ■ , K — I, if p e Vk is such that p{k) + p{n) = 1, where k < n < K 
then, 

Jk{p,w,b) = min| -776, p(A:)^T-w- 776^ +]3(n)(^-?7(/)„_fe(w, 6)^1 . 
Proof: Using ( fTSl ) we can write, 

Ckip,w,b) = p{k){T - w - rjbj 



w, n 



Uk+i + Jk+i (rk+iip, w, Uk+i),w + [/fc+i, max{6, i?fe+i}) 

For p given as in the hypothesis, the beUef in the next state is such that Tk+i {p, w, u){n) — 1. Using this observation. 
Lemma [3] (One-point), and the definition of (f)n-k in ( fTTj i, we obtain the desired result. ■ 
Discussion of Lemma \5\ The Two-points Lemma (Lemma |5]l can be used to obtain certain threshold points in 
the following way. When p ^ Vk has mass only on k and on some n, k < n < K, then using Lemma |5] the 
continuing cost can be written as a function of p{n) as, 

Ck{p,w,b) = (t - w - T]b^ - p{n)(T - w - 'ri(b ~ (f)n~k{w,b)^y (19) 

From Lemma|2] it follows that Ck{p,w,b) in ( fT9] l is a decreasing function of p{n). Let p^^'' and p^"'' be pmfs in 
Vk with mass only on iV = fc and N — n respectively. These are two of the corner points of the simplex Vk (as 
an example. Fig. |2] illustrates the simplex and the corner points for stage k = K ~2. With at most two more nodes 
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p{K) 




Fig. 2. Probability simplex, 'Pk-2, at stage K ~2. A belief state at stage X — 2 is a pmf on the points K — 2, K — 1 and K (i.e., no-more, 
one-more and two-more relays to go, respectively). Thus Vk—2 is a two dimensional simplex in 5R^. 

to go, Pk-2 is a two dimensional simplex in K"^. V^k-? ^'^^ '■^^ corner points of this simplex). 

At stage k as we move along the line joining the points and -p]^^ (Fig. |3(a)| and |3(b)| illustrates this as pin) 
going from to 1), the cost of continuing in (fT9] l decreases and there is a threshold below which it is optimal to 
transmit and beyond which it is optimal to continue. The value of this threshold is that value of j)(n) in ( fT9l l at 
which the continuing cost becomes equal to —r\h. Let 8n-k denote this threshold value, then 

T — w 

T -w - ri\b - (j)n^k{w, b)] 

The cost of continuing in fT% as a function of p{n) along with the stopping cost, —776, is shown in Fig. |3(a)| and 
|3(b)| The threshold (5„_fc is the point of intersection of these two cost functions. The value of the continuing cost 
Ck{p,w,b) at p{n) = 1 is —ri(pn-k{w,b). Note that in the case when b > (l)n--k{w,b) the threshold Sn-k will be 

(k) (n) 

greater than 1 in which case it is optimal to stop for any p on the line joining and p). ■ ■ 




Fig. 3. Depiction of the thresholds (5„_fe (to, fe). Cf^(p,w, b) in Equation )19t is plotted as a function of p{n). Also shown is the constant function 
—rjb which is the stopping cost. <5„_j. is the point of intersection of these two functions. |(a)| When b < [(b)] When b > (j>„_i^{w,b). 
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PK-2 



(a) 



,A'-i) r 

PK-2 Pk-2 




(b) 



(c) 



Fig. 4. Depiction of the inner bound C^_2('^> '')■ I" 'he examples in |(a)|[(b)] and |(c)| we only show the face of the simplex, 'Pk~2, in Fig.|2] 
with the inner bound being shown as the shaded region, [(a)] When Si and S2 are both less than 1. |(b)| When 5i > 1 and S2 < 1. |(c)| When 

(5i > 1 and 82 > 1. 



(k) 

There are similar thresholds along each edge of the simplex Vk starting from the corner point pj^ ■ In general, 
let us define for ^ = 1, 2, • • • ,K, 

T — w 



Sf. = 



(20) 



T - w - Tjfb ~ (t)i{w,b)'^ 

Remark: Note that iT% will also hold for the extended function Ck{p, w, b), where now p G Vk- In terms of the 
extended function, 5n-k represents the value of p{n) (in (fT9] l with replaced by Cfc) at which Ck{p, w, b) = —rjb. 

Recall that (from Lemma |5]l the above discussion began with a p E Pk such that p{k) + p{n) = 1. At the 
threshold of interest we have p{n) = 5n-k and hence p{k) = 1 — (5„-fe, and the rest of the components are zero. 
We denote this vector as a^^^- For instance in Fig. |4] where the face of the two dimensional simplex Vk-2 is 
shown, the threshold along the lower edge of the simplex is a\^_2 = [1 — (5i, (5i, 0] and that along the other edge 



IS a 



[I — 62, 0, 62]- Since it is possible for Sn-k > 1, therefore the vector threshold ^ is not restricted to 



K-2 



lie in the simplex Vk, however it always stays in the affine set Vk- We formally define these thresholds next. 

Definition 5: For a given k G {1, 2, • ■ • , — 1}, for each ^ = 1, 2, • • • , J-iT — fc define as a iiT — fc + 1 
dimensional point with the first and the ^ + 1 th components equal to 1 ~ 5i and 5i respectively, the rest of the 

f ik) ik-\-i\ 

components are zeros. As mentioned before, a\, lies on the line joining p), and p), .At stage k there are K — k 

(k) 

such points, one corresponding to each edge in Vk emanating from the corner point pj. '. For an illustration of 
these points see Fig. |4] for the case k ~ K — 2. ■ 
Referring to Fig. |4(a)| (which depicts the case, k = K — 2), suppose all the vector thresholds, a'^,, lie within the 
simplex Vk then, since at these points the stopping cost {—rib) is equal to the continuing cost (cfc(a^., w, 6)), all 

(k) 

these points lie in the optimum stopping set Ck{w, b). Note that the corner point (belief with all the mass on 
no-more relays to go) also lies in Ck{w, b). Since we have already shown that Ck{w, b) is convex, the convex hull 
of these points will yield an inner bound. However as mentioned earlier (and as depicted in Fig. |4(b)| and |4(c)[ ) it 
is possible for some or all the thresholds aj. to lie outside the simplex (and hence these thresholds do not belong 
to Cfe(w, b)). This is where we will use Corollary |2l where the concavity result of the continuing cost, Cfe(p, w, 6), 
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is extended to the affine set Vk- We next state this inner bound theorem: 

Theorem 1 (Inner bound): For k = 1,2, ■ ■ ■ , K — I, Recalling that p), ' is the pmf in Vk with point mass on k, 
define 

Ck{w,h) ■.^Vk^conv\pf,al,--- ,af-'=}, 

where conv denotes the convex hull of the given points. Then C_i^{w, b) C Ck{w, b). 

Proof: The way the points are defined using 61 it follows that Ck{a^i.,'w,b) = —rib (see Remark following 
(|20]|). p^,'"-' is the pmf with point mass on {N = k), so that Ck{Pk \ ^1 ^) ~ (^k{p^k\w, b) = T — w — rjb (see (fTsTi). 
Therefore the points p''^\al,--- ,a^^'' G |p £ 3?^'"'^'^+^ : p.l ~ l-,—r\b < Ck{p,w,b)^ which is a convex set 
(because Ck{p, w, b) is concave in p, from Corollary |2]). Therefore 

,af-'=} C [pe^''-''+' ■.p.l^l,~vb<Ck{p,w,b)} 

and the result follows from (fTSl ). ■ 
In Fig. m for stage k = K — 2, we illustrate the various cases that can arise. In each of the figures the shaded 
region is the inner bound. In Fig. |4(a)| all the thresholds lie within the simplex and simply the convex hull of these 
points gives the inner bound. When some or all the thresholds lie outside the simplex, as in Fig. |4(b)| and |4(c)| then 
the inner bound is obtained by intersecting the convex hull of the thresholds with the simplex. In Fig. |4(c)[ where 
all the thresholds lie outside the simplex, the inner bound is the entire simplex, Vk-2, so that at stage K — 2 with 
(Wk-2, Bk-2) — (w, b) it is optimal to stop for any belief state. 

C. Outer Bound on the Optimum Stopping Set 

In this section we will obtain an outer bound (a superset) for the optimum stopping set. Again, as in the case of 
the inner bound, we will identify certain threshold points whose convex hull will contain the optimum stopping set. 
This will require us to first prove a monotonicity result which compares the cost of continuing at two belief states 
p,q ^Vk which are ordered, for instance for fc = iiT — 2, as in Fig|5] q in Fig. |5]is such that q{K — 2) = p{K — 2) 
(i.e., the probability that there is no-more relays to go is same in both p and q) and q{K — 1) = 1 — p{K — 2) 
(i.e., all the remaining probability in q is on the event that there is one-more relay to go, while in p it can be on 
one-more or two-more relays to go). Thus q lies on the lower edge of the simplex. We will show that the cost of 
continuing at p is less than that at q. 

Lemma 6: Given p G Vk for k ~ 1,2, - ■ ■ ,K — 1, define q{k) = p{k) and q{k + l) = 1 —p{k), then Ck{p, w,b) < 
Ck{q,w,b) for any {w,b). 

Proof: See Appendix III-BI ■ 
Discussion of Lemma |6} This lemma proves the intuitive result that the continuing cost with a pmf p that gives 
mass on a larger number of relays should be smaller than with a pmf q that concentrates all such mass in p on just 
one more relay to go. With more relays, the cost of continuing is expected to decrease. ■ 
Similar to the thresholds a\ we define the thresholds 6^ that lie along certain edges of the simplex. We will 
identify the threshold a^. that is at a maximum distance from the corner point p^*^' (in Fig. |5l this point is a]^_2 = 
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[1 — (5i, (5i, 0]). Next we define the thresholds 6^ to be the points on the edges emanating from p^^\ which are at 
this same distance. Thus in Fig. |5] 6x_2 ^ '^a'-2 ^\-2 = [1 ~ f^i: 0: ^i]- 




Pk-2 



°'K-2 Pk-2 



Fig. 5. The light shaded region is the inner bound. The outer bound is the union of the light and the dark shaded regions. 



Definition 6: Now for I — 1,2,-- - ,K — k define bj, as a K — k + 1 dimensional point with the first and the 
^ + 1 th components equal to 1 — Se^^^ and Se^^^ respectively, the rest of the components are zeros. Each of the 
bj. are at equal distance from Pf, but on a different edge starting from pj, . ■ 

Using Lemma |6] we show that the convex hull of the thresholds 6^. along with the corner point p^^^^ constitutes 
an outer bound for the optimum stopping set. The idea of the proof can be illustrated using Fig. |5] p in Fig. |5] is 
outside the convex hull and q is obtained from p as in Lemma |6] At q it is optimal to continue since it is beyond the 
threshold a]^_2 and hence the continuing cost at q, Ck{q, w, b), is less than the stopping cost —776. From Lemma |6] 
it follows that the continuing cost at p, Ck{p,w,b), is also less than — ry6 so that it is optimal to continue at p 
as well, proving that p does not belong to the optimum stopping set. Thus the convex hull contains the optimum 
stopping set. We formally state and prove this outer bound theorem next. 

Theorem 2 (Outer bound): For k = 1,2, K — 1, define 

Ck{w,b) = rknconv[pi'\bl,--- ,b^-'y 

Then Ck{w,b) (ZCk{w,b). 

Proof: Let Imax = argmax^^j^ 2 ••• K~k^i- '^^ma^ ^ 1' then Ck{w,b) = Vk{^ Ck{w,b)) and the result 
trivially follows. Hence, let us consider the case where Sg^^^ < 1. Pick any p ^ Ck{w,b). We will show that 
p ^ Ck{w, b). Let q eVkhs such that q{k) = p{k) and q{k + 1) = 1 - p{k). 

p ^ Cfe(w, b) implies that p{k) < 1 — Se^^^. Since q{k + 1) = 1 — p{k) > 5i^^^ > Si, it follows that under q 
it is optimal to continue so that q ^ Ck{w,b) i.e., Ck{q,w,b) < —rjb. Finally by applying Lemma |6] we can write 
Ck{p, w, b) < Ck{q, w, b) < —rjb. This means that at p it is optimal to continue so that p ^ Ck{w, b). ■ 

The outer bound for k = K — 2 is illustrated in Fig. |5] The light shaded region is the inner bound. The outer 
bound is the union of the light and the dark shaded regions. The boundary of the optimum stopping set falls within 
the dark shaded region. For any p within the inner bound we know that it is optimal to stop and for any p outside 
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the outer bound it is optimal to continue. We are uncertain about the optimal action for belief states within the dark 
shaded region. 

VI. Optimum Relay Selection in a Simplified Model 

The bounds obtained in the previous section require us to compute the threshold functions {(pg : £ = 0,1, ■ ■ ■ , K — 
1} (see Definition O recursively. These are computationally very intensive to obtain. Hence, in this section we 
simplify the exact model and extract a simple selection rule. Our aim is to apply this simple rule to the exact model 
and compare its performance with the other policies. 

A. The Simplified Model 

Now we describe our simplified model. There are N relays. Here, iV is a constant and is known to the source. 
The key simplification in this model is that here the relay nodes wake-up at the first N points of a Poisson process 
of rate ^. The following are the motivations for considering such a simplification. Note that in our actual model 
(Section when N ^ N, the inter wake-up times {C/fc : 1 < fc < N} are identically distributed ll27l Chapter 
2], but not independent. Their common cdf (cumulative distribution function) is F[/j^|jv(w|^) = 1 — (1 — 
for u £ (0, T). From Fig. |6] we observe that the cdf of {Uk ■ I < k < N} is close to that of an exponential 
random variable of parameter and the approximation becomes better for large values of N (for a fixed T). This 
motivates us to approximate the actual inter wake-up times by exponential random variable of rate Further in 
the simplified model we allow the inter wake-up times to be independent. Finally, observe that in the simplified 
model the average number of relays that wake-up within the duty cycle T is N which is same as that in the exact 
model when N = N. 




0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 

u u 



(a) (b) 
Fig. 6. The cdfs | jv ( . ] JV) and where Y ~ Exponential{^) (with T = 1) are plotted for[(a)]7V = 5 and|(b)]iV = 15. 

We will use the notations such as Wk, Rk,Uk, etc., to represent the analogous quantities that were defined for 
the exact model. For instance, Wk represents the wake-up time of the k th relay. However, unlike in the exact 
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model, here Wk can be beyond T. As mentioned before, {Uk : k = 1, • ■ • ,N} are simply iid exponential random 

N_ 
T ■ 



variables with parameter -y-. {Rk : k ^ 1, - ■ ■ , N} are iid random rewards with common pdf fji which is same as 



that in the exact model. 



B. MDP Formulation 



Again, here the decision instants are the times at which the relays wake-up. At some stage k, 1 < k < N, 
suppose {Wk,Rk) = {w,b) then the one step cost of stopping is —rjb and that of continuing is Uk+i- Note that 
since IJk+i Exp{y), the one step costs do not depend on w, which means that the optimal poUcy for the 
simplified model does not depend on the value of w. Also since the number of relays is a contant, we do not 
wish to retain it as a part of the state unlike that in the actual state space (Equation (|6]l). Therefore we simplify 
the state space to be Sq — {0} and for fc = 1, 2, • • • ,N, 

Sk = [0,i?]U w. 

As before ijj is the terminating state. Suppose at some stage 1 < k < N the state is Bk = b then the next state 
Sk+i will be 

V' if flfe = 1 

max{6, Rk+i} if Ofe = 
We had mentioned the one step costs earlier. We write them down here for the sake of completeness, 

, /, \ I -Vb if ak = 1 
gk[b,ak] = < ^ 

^ ' \ Uk+i iiuk^O 
The cost of termination is simply gfj{b) — —rjb. 

C. Optimal Policy via One-Step-Stopping Set 

In this section we will prove that the one-step-look-ahead rule is optimal for the simplified model. The idea is to 
show that the one-step-stopping set is absorbing ll28l Section 4.4]. All these will now be defined. For an alternate 
derivation of the optimal policy by value iteration, see the next section (Section [VI-Dl l. 

At stage fc, 1 < fc < iV, when the state is b, the cost of stopping is simply Cs{b) — —rjb. The cost of continuing 
for one more step (which is Uk+i) and then stopping at the next stage (where the state is max{b, Rk+i}) is. 



Cc{b) = E t/fc+i - 77max{6, 



= -r;(E[max{6, R}] ~ — 



By defining the function f3{-) for 6 e [0, R] as 

T 



13(b) = E max{6,i?} 

we can write Cc{b) — —rjj3{b). Note that both the costs, Cg and Cc, do not depend on the stage index k 
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Definition 7: We define the One-step-stopping set as. 



step 



{be[0,R] : -vb<-vm} 



(22) 



i.e., it is the set of all states b G [0, R] where the cost of stopping, Cs{b), is less than the cost of continuing for one 



We will show that Cistep is characterized by a threshold a and can be written as Cistep = [a, R]- This will 
require the following properties about /?(•). 
Lemma 7: 

1) /? is continuous, increasing and convex in b. 

2) If /3(0) < 0, then ^(6) < b for all b e [0,7?]. 

3) If /3(0) > 0, then 3 a unique a such that a = f3{a). 

4) If /3(0) > 0, then ^(6) < for 6 G (a, i?] and > 6 for & G [0, a). 

Proof: See Appendix IIII-AI ■ 
Discussion of Lemma 0- When /3(0) > then using Lemma |2l3 and |2l4, we can write Cistep in (l22l i as 
Cistep = [o^i-R]- For the other case where /3(0) < 0, from Lemma |2]2 it follows that Cistep = [0,i?]. Thus by 
defining a = whenever /3(0) < we can write Cistep = [a, -R] for either case. ■ 
Definition 8: Depending on the value of /3(0) define a as follows. 



Definition 9: A policy is said to be one-step-look-ahead if at stage k, 1 < k < N , it stops iff the state b G Cistep, 
i.e., iff the cost of stopping, Cs{b), is less than the cost of continuing for one more step and then stopping, Cc(6). ■ 

Definition 10: Let C be some subset of the state space [0, R\, i.e., C C [0, R\. We say that C is absorbing if for 
every 6 G C, if the action at stage fc, 1 < /c < TV, is to continue, then the next state, Sk+i at stage k + 1, also falls 



Since we have expressed Cistep as [a,R\ and since Sfc+i = max{6, /jfc+i} it is clear that Cistep is absorbing. 
Finally, referring to 1281 Section 4.4], it follows that, for optimal stopping problems, whenever the one-step-stopping 
set is absorbing then the one-step-look-ahead rule is optimal. Thus the optimal policy for the simplified model is 
to choose the first relay whose reward is more than a. If none of the relays' reward values are more than a then 
at the last stage, N , choose the one with the maximum reward. 

D. Optimal Policy via Value Iteration 

In this section we provide an alternative derivation for the optimal policy (already obtained in the previous 
section). We will write down the value functions starting from the last stage N and proceed backwards, and then 
simplify to obtain the optimal policy. 



more step and then stopping at the next stage Cc{b). 




(23) 



into C. 
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The value function for the last stage N is simply Jj^{h) = gfj{b) = —rib. Next, when the stage is iV — 1, 



JN-iib) 



min < — rjb, E 



'qb,E 



Uff + Jn{ max{&, Rf^j} 



rj max{6, R 



N 



T 



— min ?7&, —i] ^E[max{6, R}] 
= min I - 776, -riPi (b) | , (24) 
where the function /3i(-) is exactly same as the function ) in ( |2TI ). which we reproduce here for convenience, 



lib) 



E 



max{6, R} 



T 



(3i satisfies the properties listed in Lemma |7] 

From ( l24l i it is clear that at stage iV — 1 the optimal policy is to stop iff —776 < —rj/3i{b), i.e., iff b > /3i(6). 
Whenever /3i(0) < 0, from Lemma |7]2 and (l24l i. we observe that at stage — 1 it is optimal to stop for any 
b e [0, i?]. On the otherhand when /3i(0) > 0, from Lemma|2]3, |7]4 and ( l24b . we can conclude that it is optimal 
to stop iff 6 > a. A plot of the function for the case when /?i(0) > is shown in Fig. |7] It will follow 

that there is a similar function at each stage. Formally, at stage k there is a function (3k ^k{-) such that at stage 
k it is optimal to stop iff b > f3K^k{b)- Further PK-k{') statisfies for b < a, (3K-k{b) > /3i(fc) and for b > a, 
PK-k{b) — I3i{b). This property of the 13 functions is illustrated in Fig. [T] for stages K ~ 2 and K — 3. Thus the 
optimal policy at any other stage k — 1,2, ■ ■ ■ , N — 2, is same as the above mentioned a-threshold policy. 




Fig. 7. Simplified Model: Illustration of the sucessive /3 functions. The threshold a is the point of intersection of f^i{b) with the linear function, 
b. In the figure, a = 0.6. I3i(b) as a function of £ is increasing for b < a. For b > a, f}i{b) = f}\{b). 



First we will extend the definition of a for the case when /3i(0) < by defining a = (which is same as the 
definition of a in (|23] | in the previous section). 
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Definition 11: Depending on the value of /3i(0) define a as follows, 

[ /3i(a) if /3i(0) >0 








Lemma 8: For every k ^ {1,2, ■ ■ ■ , N ~ 1} the following holds, 



Jk{b) 



mm 



(25) 



where Pi{b) is as defined in (l2Tl i and for k — 1,2, ■ ■ ■ ,N — 2, 



I^N^kib) = IE max 



|6, R, (max{6, 



(26) 



and has the property, (3fj_^.{b) > /5jv_(fe+i) (&) for any b e [0,i?]. In particular if 6 > a then Pfj_k{b) = 

Proof: Here we provide only an outline of the proof. For a complete proof, see Appendix IIII-BI The result 
already holds for k ~ N — 1 (see ( |24] | and (ISTTi). Next we prove the result for N ~ 2. The proof is by induction. 
Suppose for some fc, 1 < fc < — 2, (l25T l and (|26] | hold along with the ordering property mentioned in the Lemma. 
We write down the value function Jk-i in terms of Jk and straight forward manipulation will yield (IZSl l and (l26l l 
for fc — 1. The ordering result for fc — 1 can also be easily obtained by using the ordering result for k. In Fig. |7] we 
have depicted this ordering behaviour of the /3f functions. ■ 
The following main theorem is a simple consequence of the Lemma |7] and Lemma |8] 

Theorem 3: At any stage k — 1,2, ■ ■ ■ , N—1 the optimal policy for the simplified model is to stop iff Bk = b > a. 
Proof: From dZSl l in Lemma |8] it follows that the optimal policy is to stop iff —776 < — 77/?^ ^.(&) i.e., 
b > /3jv_fe(^)- If 6 > a then from Lemma |2]4 and Lemma [8] we have b > — P^^ki^) hence it is optimal 

to stop (see Fig. |7] for an illustrations). On the otherhand if 6 < a then (again from Lemma |7l4 and |8]l we have 
b < < f3fj_i;{b) and hence the optimal action is to continue. ■ 

Thus the policy for the simplified model is to simply select the first relay with a reward of more that a. If all 
the relays have reward of less than a then at the last stage N, choose the one with the best reward. 

E. Analysis of the a-Threshold Policies 

We have thus seen that the optimal policy for the simplified model is characterized by a threshold a. Let Ra 
represent the reward obtained when the threshold used is a. Ra is equal to the reward value of that relay to which 
the packet is finally forwarded. We are interested in obtaining an expression for E[i?Q,] (this will be useful later in 
Section FVII-Bl i. E[_Rq,] can be written down as 



which will require us to obtain F{Ra > r) for r G [0, R\. Let us consider two cases, r G [0, a] and r G {a, R\. 




(27) 
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For r G [0,a], the average reward Ra > r whenever there is at least one relay with a reward value of more than 
r. Therefore for r G [0, a], 

P(i?„>r) = P(max{7?i,--- > ^) 

= 1 -P(niax{^i,--- < r) 

= \-FB.{rf. (28) 

The third equality is because the i?i's are iid with Fji being their common cdf. 

Now for r G the average reward Ra > r whenever the set of relays whose rewards are more than a is 

nonempty and further the reward of the first relay to wake-up from this set is more than r. Therefore for r G (a, i?], 

P(i?a>r) = (i-Fn{af)\^^^ (29) 
V /I - FR{a) 

1 ~ Fji{a)^ is the probability that there is at least one relay with a reward value of more than a and IZp'^^a) the 
probability that the reward of the first relay (to wake-up from the set mentioned above) is more than r conditioned 
on the fact that its reward is already more than a. 

Using ( |28] ) and ( l29t in ( |27] | it is possible to numerically compute E[i?Q,]. We will use these expressions while 
describing a policy tta-simpl (in Section IVII-BI ) which is derived from the simplified model. For ai > a2 it is 
clear than Ra-^ > Ra^ which means that E[i?Q] as a function of a is non decreasing. 

VII. Numerical and Simulation Results 

A. One Hop Performance 

Recall (from Section that our model admits any general reward associated with a relay. In this section we 
perform and discuss a simulation study of geographical forwarding in a dense sensor network with sleep-wake 
cycling nodes where the reward provided by a relay is the progress made towards the base-station (or sink) if the 
packet is forwarded to that relay. In Appendix IIV I we have shown simulation results for other rewards (e.g., reward 
being a function of the progress and channel gain). 




Fig. 8. The hatched region is the foi^warding region. 

The source and sink are separated by a distance of c? = 10 (see Fig. [8]i. The source has a packet to forward at 
time 0. The communication radius of the source is Tc = 1. The potential relay nodes are the neighbors of the source 
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that are closer to the sink than itself. The period of sleep-wake cycling is T = 1. Let Zi represent the progress of 
relay i. Zi is the difference between the source-sink and relay-sink distances. The reward associated with a relay i 
is simply the progress made by it, i.e., Ri = Zi. We interchangeably use progress and reward in this section. 

Each of the nodes is located uniformly in the forwarding set, independently of the other nodes. Therefore, it can 
be shown that, the progress made by them are iid with pdf 



and the support of fz is [0,rc]. Hence Tc is analogous to B (see System Model, Section Hill. We take the bound 
on the number of relays as K = 50, and the initial pmf is taken as truncated Poisson with parameter 10, i.e., for 
n = 1, 2, • • • , iiT, po{n) = ci^e~^° where c is the normalization constant. The above mentioned reward pmf (fz) 
and initial belief (po) will be a good approximation if the nodes are deployed in a region according to a spatial 
Poisson process of rate 10. The approximation will become better for larger values of K. 

Since it is computationally intensive to obtain the thresholds {(pi} in dTTl) inductively, we have discretized the 
space [0,T] x [0,R] into 100 x 100 equally spaced points and obtain {0;} at these points. Appropriate pmfs are 
obtained from the pdfs. All the analysis in the previous sections hold for this discrete setting as well. 

When the actual state space is discrete, then there are established algorithms to obtain the optimal policy for 
POMDP problems ll22l . ||231 . ll24l . However it is highly computationally intensive to apply these algorithms here 
because of the large state space. For instance with K = 50, the cardinality of Sf is 50 x 100 x 100. Hence we 
compare the performance of our suboptimal POMDP policies with the COMDP policy (Section lIVI l that is optimal 
when the actual number of relays is known and hence serves as a lower bound for the cost that can be achieved 
by the optimal POMDP policy. 

1) Implemented Policies (one-hop): We summarize the various policies we have implemented. 

• T^COMDP- The source knows the actual value of N. Suppose N — n, then the source begins with an initial 
belief with mass only on n. At any stage, k = 1,2, •• • ,n, if the delay and best reward pair is {w,b) then 
transmit if 6 > (j>n~k{w,b), continue otherwise. See the remark following Lemma [3] 

• THINNER- We use the inner bound C_f.{w,b) to obtain a suboptimal policy. At stage k if the belief state is 
{p,w,b) (g Sk), then transmit iff p & Cf.{'w,b). 

• T^OUTER- We use the outer bound Ck{w,b) to obtain a suboptimal policy. At stage k if the belief state is 
(p, w, b) (g Sk), then transmit iff p G Ckiw, b). 

• TTA-COMDP'- (Average-COMDP) The source assumes that N is equal to its average value N — [EiV] |^ and 
begins with an initial pmf with mass only on N . Suppose N — n, which the source does not know, then at some 
stage fc = 1, 2, • • • , min{n, N} if the delay and best reward pair is (w, b) then transmit iff b > (t)j^_^{w, b). 
In the case when > n, if the source has not transmitted until stage n and further at stage n if the action 




(30) 



Area of the forwarding region 



^[x] represents the smallest integer greater than x. 
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is to continue, then since there are no more relays to go, the source ends up waiting until time T and then 
forwards to the node with the best reward. 
• TTA-siMPL- (Average-Simple) This policy is derived from the simplified model described in Section |Vll The 
source considers the simplified model assuming that there are iV = [EA^] number of relays. It computes the 
threshold a accordingly using ( l23T l. The policy is to transmit to the first relay that wakes up and offers a 
reward (progress in this case) of more than a. If there is no such relay then the source ends up waiting until 
time T, and then transmits to the node with the best reward. 
2) Discussion: We have performed simulations to obtain the average values for the above policies for several 

values of 77 ranging from 0.1 to 1000. In Fig. |9(a)[ we plot the average delays of the policies described above as a 

function of 77. The average reward is plotted in Fig. |9(b)| 




10"' 10° 10' 10= 10' 10 ' 10° 10' 10= 10' 

n n 

(a) (b) 



Fig. 9. I (a) I Average Delay as a function of rj. |(b)| Average Reward as a function of rj. 

As a function of 77 both the average delay and the average reward are increasing. This is because for larger rj 
we value the progress more so that we tend to wait for longer time to do better in progress. For very small values 
of T], all the thresholds ({0^} and a) are very small and most of the time, the packet is forwarded to the first node 
(referred to as the First Forward policy in |7|). For very high values of rj the policies end up waiting for all the 
relays and then choose the one with the best reward (referred to as the Max Forward poUcy in Q). Therefore, 
as 7] increases the average progress of all the policies (excluding tta-comdp) converge to E[max{Zi, • • ■ , Zn}] 
which is about 0.82 (see Fig. |9(b)| i. However the average progress for tta-comdp converges to a value less than 
0.82. This is because whenever N < N and for large 77 (where all the thresholds {(j>e} are large) tta-comdp ends 
up waiting for the first A^ relays and obtain an progress of max{Zi, • ■ • , Zj^} which is less than (or equal to) the 
progress made by the other policies (which is max{Zi, • • • , Zn})- 

Recall that the main problem we are interested in is the one in (fTTT l. We should be comparing the average delay 
obtained using the above policies such that the average reward provided by each of them is 7. This will require 
us, for each policy, to use an 77 such that the average reward is equal to 7. Since we do not have any closed form 
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expression for average reward in terms of ry, we proceed as follows. We fix a target 7. For each policy, we choose 
among the several average reward values (corresponding to the several 77 values) the one that is closest to the target 
7 and consider the corresponding average delay. For different target 7, in Tables and |lll] we have tabulated such 
average progress and delay values respectively for different policies. 



Target 7 


0.6800 


0.7200 


0.7600 


0.8000 


^[^TTCOM op] 


0.6840 


0.7198 


0.7612 


0.8000 




0.6822 


0.7212 


0.7600 


0.8001 




0.6789 


0.7208 


0.7578 


0.8003 


^i-^A-COM Dp] 


0.6773 


0.7195 


0.7590 


0.8005 


^[^■^a-simpl] 


0.6819 


0.7165 


0.7585 


0.7996 



TABLE II 

For a given target 7 (a column) and a policy (a row) the entry in the table corresponds to the average progress 

value that is closest to the target 7. 



Target 7 


0.6800 


0.7200 


0.7600 


0.8000 


^[Dt^COM DP ] 


0.2262 


0.2711 


0.3529 


0.5012 


^[^TTINNEr] 


0.2343 


0.2905 


0.3735 


0.5450 


^[^t^outer] 


0.2359 


0.2967 


0.3756 


0.5551 


^[Dtta-COMDp] 


0.2336 


0.2954 


0.3825 


0.5997 


^[-^^a-simpl] 


0.2338 


0.2823 


0.3684 


0.5415 



TABLE m 

For a given target 7 (a column) and a policy (a row) the entry in the table is the average delay value corresponding 

TO the average progress value in TableHiI 



The entries in the first row of both the tables contain different values of target 7 (namely, 0.68, 0.72, 0.76 and 
0.8). We will discuss the entries in the last column (i.e., entries corresponding to the target 7 of 0.8). By reading 
the values from the last column of Table. which contains the average progress values, we see that the average 
progress for all the policies are within ±0.0005 of 0.8 (for other columns all the entries are within ±0.005 of 
the corresponding target 7). Hence it is reasonable to compare the delay values of the various policies in the last 
column of Table Hill As expected, the COMDP obtains the lowest delay (of 0.5012). There is only a very small 
performance gap between the INNER and OUTER bound policies i.e., the delay obtained by the INNER bound 
poUcy (0.5450) is sUghtly less than that of the OUTER bound policy (0.5551). The scheme A-COMDP, which 
simply assumes that the actual number of relays is the average of the initial belief, results in a higher delay (of 
0.5997). Interestingly we observe that the policy A-SIMPL, which was derived from the simplified model performs 
very close to the INNER bound policy (with an average delay of 0.5415). Other columns can be read similarly. For 
small values of target progress, 7, we see similar performance for all the policies. These observations are for the 
particular case where the reward is simply the progress and the initial belief is truncated Poisson. In Appendix II VI 
we have shown simulation results for other reward structures and initial beliefs. We observe similar behavior there 
as well. 
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B. End-to-End Performance 

The single hop problem considered by us was originally motivated by the end-to-end problem. In the geographical 
forwarding context, the end-to-end metrics of interest are the total delay and hop count. Hop count is important 
because it is proportional to the number of transmissions and hence the energy expended by the network. Each of 
these metrics immediately motivates us to consider two extreme policies. One policy is for each node to transmit 
to its first neighbor in the forwarding set to wake-up. The second policy is to wait for all the neighbors in the 
forwarding set to wake-up and then transmit to the one that makes maximum progress towards the sink. It is 
reasonable to expect that the first policy will minimize the end-to-end delay while the second one will result in 
the least hop count. Hence there is a tradeoff between the two metrics. Suppose we want to minimize the average 
total end-to-end delay by imposing an average hop count constraint of h. Let d be the distance between the source 
and the sink. Heuristically, we expect that the hop count constraint would be (approximately) met if each node, 
enroute to the sink, contributes an average progress of j . For this average progress constraint if each node now 
uses the locally optimal policy (ttcomdp), we expect the average delay at each hop to be minimized and, hence, 
obtain close to optimal average total delay. Instead of the optimal policy, each node can use the policy tta-simpl 
since its one hop performance is close to the optimum. Also, its application only requires a node i to compute a 
simple threshold a^, unlike the other policies where the threshold {4>e} computation is intensive. Fig. [TOl illustrates 
the multihop forwarding algorithm with each node using the locally derived threshold (obtained form the simplified 
model in Section IVII) to forward. Next we briefly describe the network setting and the implemented poUcies. 











1 y d 



'Sink 



Fig. 10. Each node enroute to the sink uses the threshold obtained from the simplified model, and aj are the thresholds used by nodes i 
and j respectively. 

1) Network Setting: First we fix a network by placing AI nodes randomly in [0, L]^ where L ~ 10. M is sampled 
from Poisson{\L?) where A = 5. Additional source and sink nodes are placed at the locations (0, 0) and (L, L) 
respectively. Further we have considered a network realization where the forwarding set of each node is nonempty. 
The wake-up times of the nodes are sampled independently from Uniform{[Q^T]) with T = 1. If the wake-up 
instant of a node i is Ti then it wakes up at the periodic instances {kT + Ti : k > {)} . The communication radius 
of each node is = 1. The source is given a packet at time and we are interested in routing this packet to the 
sink. 
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2) Implemented Policies (end-to-end): We also compare our work with that of Kim et al. |[T] who have developed 
end-to-end delay optimal geographical forwarding in a network setting similar to ours. We first give a brief 
description of their work. They minimize, for a given network, the average delay from any node to the sink when 
each node i wakes up asynchronously with rate r^. They show that periodic wake up patterns obtain minimum delay 
among all sleep-wake patterns with the same rate. A relay node with a packet to forward, transmits a sequence of 
beacon-ID signals. They propose an algorithm called LOCAL-OPT [13 | which yields, for each neighbor j of node 
i, an integer /i^*^ such that if j wakes up and listens to the h th beacon signal from node i and if h < h^^\ then j 
will send an ACK to receive the packet from i. Otherwise (if h > /i^*^) j wiU go back to sleep and i will continue 
waiting for further neighbors to wake-up. A configuration phase is required to run the LOCAL-OPT algorithm. 

To make a fair comparision with the work of Kim et al. in our network setting we also introduce beacon-ID 
signals of duration i/ = 5 msec and packet transmission duration of to = 30 msec. Description of the policies we 
have implemented is given below, 

• TTpF (First Forward): Each of the node, whenever it gets a packet, it will always transmit to the first neighbor 
in its forwarding set to wake-up, irrespective of the progress made by it. 

• T^MF (Max Forward): We assume that each node knows the number of neighbors in its forwarding set. in this 
policy a node, when it gets a packet, it will wait for all of its neighbors in the forwarding set to wake-up. 
Finally when the last node wakes up, it will forward the packet to the one which achieves maximum progress 
towards the sink. 

• TTSF (Simplified Forward): This end-to-end policy works by applying the tta-simpl policy at each hop. First 
we fix 7 as a network parameter (as mentioned before, 7 can be set to ■^). Nodes do not know the number 
of neighbors in their forwarding set. However they know the node density and thus estimates this number as 
[A X forwarding set area]. Using this estimated number, a node considers the simplified model and comes up 
with a threshold a such that the average progress Ei?a in dZTl ) is equal to 7 (see also ( |28] | and (|29])). Ei?o, 
as a function of a is non decreasing. Hence for some node i, if 7 < Ei?o then node i chooses its threshold 
as 0, and if 7 > Ei^^c then node i uses Tc as its threshold. Suppose node i has a packet to forward. When a 
neighbor of node i, say node j, wakes up and hears a beacon signal from i, it waits for the ID signal and then 
sends an ACK signal containing its location information. If the progress made by j is more than the threshold, 
then i forwards the packet to j (packet duration is tn = 30 msec). If the progress made by j is less than the 
threshold, then i asks j to stay awake if its progress is the maximum among all the nodes that have woken 
up thus far, otherwise i asks j to return to sleep. If more than one node wakes up during the same beacon 
signal, then contentions are resolved by selecting the one which makes the most progress among them. In the 
simulation, this happens instantly (as also for the Kim et al. algorithm that we compare with); in practice this 
will require a splitting algorithm; see, for example, ||30l Chapter 4.3]. We assume that within t/ = 5 msec all 
these transactions (beacon signal, ID, ACK and contention resolution if any) are over. If there is no eligible 
node even after the — th beacon signal (one case when this is possible is when the actual number of nodes 
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Fig. 11. End-to-end performance: Plot of average end-to-end delay vs. average end-to-end hop count obtained by applying the simple a- 
threshold policy, vryi_g7jv/p£, at each hop. The operating points of the policies npp, n]\jp and Kim et al. are also shown in the figure. Each 
point on the curve corresponds to a different value of 7 which increases along the direction shown. 

N is less than [A x Forwarding set Area] and none of the nodes make a progress of more than the threshold) 
then i will select one which makes the maximum progress among all nodes. 

• TTSF- This is the same as ttsf, but here we assume that each node knows the exact number of neighbors in 
its forwarding set and uses this exact number to come up with the threshold a. Unlike in the previous case, 
here if none of the neighbors of node i make a progress of more that the threshold used by i then, knowing 
the number of neighbors, node i choose the neighbor with the best progress when the last one wakes up. irpp 
and ttmf can be thought of as special cases of ttsf with thresholds of and respectively. 

• Kim et al.: We run the LOCAL-OPT algorithm jTS) on the network and obtain the values /i^*'' for each pair 

where i and j are neighbors. We use these values to route from source to sink in the presence of sleep 
wake cycling. Contentions, if any, are resolved (instantly, in the simulation) by selecting a node j with the 
highest h)j index. 

3) Discussion: In Fig.[TT]we plot average total delay vs. average hop count for different policies for fixed node 
placement, while the averaging is over the wake-up times of the nodes. Each point on the curve is obtained by 
averaging over 1000 transfers of the packet from the source node to the sink. As expected, Kim et al. achieves 
minimum average delay. In comparision with ttff, Kim et al. also achieve smaller average hop count. Notice, 
however that using ttsf (or ttsf) policy and properly choosing 7, it is possible to obtain hop count similar to that 
of Kim et al., incurring only slightly higher delay. 

The advantage of irgp over Kim et al. is that there is no need for a configuration phase. Each relay node has 
to only compute a threshold that depends on the parameter 7 which can be set as a network parameter during 
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deployment. A more interesting approach would be to allow the source node to set 7 depending on the type of 
application. For delay sensitive applications it is appropriate to use a smaller value of 7 so that the delay is small, 
whereas, for energy constrained applications (where the network energy needs to conserved) it is better to use large 
7 so that the number of hops (and hence the number of transmissions) is reduced. For other applications, moderate 
values of 7 can be used. 7 can be a part of the ID signal so that it is made available to the next hop relay. 

Another interesting observation from Fig. [TT]is that the performance of ttsf is close to that of ttsf- In practice, 
it may not be possible for a node to know the exact number of relays in its forwarding set, due to varying channel 
condition, node failures, etc. Recall that fcgp works with the average number of nodes instead of the actual number. 
For small values of 7 both the policies ttsf and ttsf, most of the time, transmit to the first node to wake up. 
Hence the performance is similar for small 7. For large 7, we observe that the delay incurred by ttsf is larger 

VIII. Conclusion 

Our work in this paper was motivated by the problem of geographical forwarding of packets in a wireless sensor 
networks whose function is to detect certain infrequent events and forward these alarms to a base station, and 
whose nodes are sleep-wake cycling to conserve energy. This end-to-end problem gave rise to the local problem 
faced by a packet forwarding node, i.e., that of choosing one among a set of potential relays, so as to minimize 
the average delay in selecting a relay subject to a constraint on the average progress (or some reward, in general). 
The source does not know the number of available relays, which made this a sequential decision problem with 
partial information. We formulated the problem as a finite horizon POMDP with the unknown state being the 
number of available relays. The optimum stopping set is the set of all pmfs on the number of relays for which 
the average cost of stopping is less than that of continuing. We showed that the optimum stopping set is convex 
(Corollary [Til and obtained threshold points along certain edges of the simplex which belong to the optimal stopping 
set. A convex combination of these point gave us an inner bound for the optimum stopping set (Theorem [TJ. We 
proved a mono tonicity result and obtained an outer bound (Theorem |2]l. We also obtained a simple threshold rule 
by formulating an alternate simplified model (Section IVll l. 

We have performed simulations to compare the performance of the various policies. We observe that the inner 
bound policy {ttjnneb) is better than the outer bound (ttouter)- Further the performance of the simple threshold 
policy (tta-simpl) is comparable with ttinner, both of which are close to the optimal policy {ttcomdp)- We 
have performed one-hop simulations for few other examples where we have considered different rewards and initial 
beliefs (see Appendix llVb. In all the examples, we observe the good performance of the policy tta~simpl- 

We have devised simple end-to-end policies {ttsf and ttsf) using tta-simpl- We have shown that by varying 
a network parameter these policies can favourably tradeoff between the average total delay and average hop count. 
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Appendix I 
Proofs of Lemmas in SectionHvI 

A. Proof of Lemma |2] 

From ( fTTI i (the subscripts of both U and R in the following expressions is K — 1 + which we have suppressed 
for simplicity), 

U 
V 

max ^ b, R, (f>£-i [w + U, max{6, R} ] \ WjK 
T ~ w 



(j)i{w,b) = Ex-e 

> EK-e 

> 6- 



max < &, R, 4>i-i (^w + U, max{6, R} 



w, K 

T — w 
V 



where the first inequality follows from U <T — w and the second due to the max inside the expectation. ■ 
B. Proof of Lemma \3\ 

We proceed by value iteration. First we will show that the lemma holds for k — n, where the fixed n could be 
either less than K or equal to K (recall that K is the bound on the number of relays). Suppose n < K. Since 
Pn(n) = 1, from ( fTSl l. it follows that Cn{Pn, w ^ T — w — i]b. Therefore, 

JniPn, w, b) = min | — rjb, T — w ~ rjb^ — —rib. 

If n — K then Jn{pn, w, b) = gK{pn, w, b) — —rjb. Thus for any fixed n we can write 

Jn{Pn,w,b) = mill I - rjb, -770„_„(w, 

Suppose for some fc = 1, • ■ • , n — 1 the following holds, 

Jk+i{Pk+i,w,b) = min| -776, -770„_(fe+i)(w,6)|. 



Uk 



Then, 

Ck{pk,w,b) ^ Efe 

= Efe 



;+i + ,/fe+i (jk+i{Pk,w, Uk+i),w + C/fe+i, max{6, Rk+i} 
Uk+i + mill I - 77max{6, i?fe+i}, -?70n-(fe+i 

max|6,i?fe+i,(/)„_(fc+i) (w + J7fe+i, max{6, i?fe+i}^ | 



w, n 

(k+i) [w + JJfe+i, max{6, Rk+i} 



w, n 









w, n 
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Ck{pk,w,b) = -rjEK-in-k) 



max R, (/)„„(fc+i) (^w + U, max{6, i?}^ 



In the second equality we have used the induction hypothesis and the fact that if pk (n) = 1 then 
Tk+i{pk,w,Uk+i){n) = 1. The expectation in (ISTT i is over the pdf fii{)fk{\w,n). From (IH, note that the pdf 
fk{\w, n) depends on k and n only through the difference n — k. Therefore fk{-\w, n) = fK-{n-k){-\'^^ K). Using 
this and ( fTTI i in ( |3T1 ) we can write 

— w^K 
V 

= -ri4>n-k{w,b). (31) 

Finally using ( fT6l l. we can write, Jkipk, w, h) = min | — —ri(f>n-k{w, &)|- Hence we have proved that the lemma 
holds for k if it is true for fc + 1. Since we have already shown that the lemma holds for n, from induction argument 
we can conclude that it holds for all fc = 1, 2, ■ • • , n. ■ 

Appendix II 
Proof of Lemmas in SectionIv] 

A. Proof of Lemma |4] 

The essence of the proof is same as that in |[25'. Lemma 1]. We provide the proof here for completeness. The 
cost to continue at stage i^T — 1 is (see (fTSll). 

CK-iiPiW.h) ~ p{K ~ 1){t — w — ribj + p{Ky&x-i Uk — 'ri''^^^{t'i Rk} w, K 

= p{K-l)(T-w-rjb^+p{K)(^-T]Mw,b)y (32) 

Thus we have shown that CK-i{-,w,b) is an affine function of p G Vk-i, for every {w,b). Recalling (fT6l) . 
Jk-i{-, w, b), being the minimum of two affine functions, —776 and ck~i{-, w, b), is concave on Vk-i- The proof 
now proceeds by induction. 

Induction hypothesis: For some fc = 1, 2, • • • , ii' — 2, and for each {w, b), Jfe+i(-, w, b) is concave on Vk+i and 
can be written down as, 

Jk+i{p,w,b) = inf {a,p) 

= (c^'^ri^'^p)^ (33) 

where Ak+i{w,b) is some collection of K — k length vectors and a^^™''''' — argmin^g_4^^^(^ f,-, {a,p). 

There are two points to note here. First, in general a concave function can be written down as an infimum over 
some collection of affine functions of the form {a^p) + c where c is some constant. However, we claim that there 
are no such constants associated with the a vectors in the set Ak+i{w^ b). Second, we are claiming the existence of 
the vector Notice that both of these claims are true for stage K — 1, since the set Ak-i{w, b) comprises 

only two vectors, ^(T — w — Tjb), —'r](j)i{w, 6)^ and (— —lib), i.e., the induction hypothesis holds for fc = if — 1. 
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To show that Jfc(-, w, b) is concave on Vk, it suffices to prove that Ck{-, w, b) is concave. Ck in (fTSl l can be written 
down as, 



K 



Cfc( 



{p,w,b) = p(fc)fT - w - ry5j + ^ p{n)¥.k 



Uk^ 



=k+l 



K 



n=/c+l '- 

Let us focus on the third term in the above summation. Call it S3 for convenience. 



n 



(34) 



S3 = p{n) / fR{r)fk{u\w,n)Jk+i[Tk+i(p,w,u),w + u,ma.x{b,r})dudr 

n=k+l ^0 ^ ^ 

E / / fR{r)fk{u\w,n)la^^ir^^'-'-^''"^^'^^^^^^^ (35) 

_i, , 1 Jo Jo \ ' 



n=k+l 

Substituting for Tk+i{p, w, u) from ^ and simplifying yields 
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(36) 



Define K — k + 1 length vector a^. 
Then (|34l t can be written as, 



{p,w.b) (p.w^b) 
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as Qfj. [k) — {T ~ w — rjb) and for n = k + 1, ■ ■ ■ ,K, 

jR[r)fk[u\w,n)al^J "{n)dudr. 
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Ck{p,w,b) 



,p 



Now for any q ^ p if we write down (^a^^'^'^\p^, then it will have a term similar to S3 (see ( l35T l and (|36]|). 
but with Q!fc^]^ 1 ' replaced with a[,^j;^ > ; ^ , ^ — Let us call this term as S3. More 

precisely (ai^'^'''\p\ will be similar to RHS of ( [34l l. but with the third term there (recall that we had named the 
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third term as S3) replaced by S3. Using (l33T l in dSST l we observe that S3 > S3 so that, 

Ckip,w,b) < (a^^''"'^\p^ . 
Hence by defining Akiw, b) :— {a^^'^'^'^ : q € Vk} we can write, 

Ck{p,w,b) = inf {a,p) 

which proves that Cfc(-, w, b) is concave. Finally, by including in the set Ak{w, b), the K — k + 1 length vector with 
each component equal to — r/6, we can express Jk{p, w, b) as. 



Jk{p,w,b) 



inf 
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B. Proof of Lemma |6| 

Since q has mass only on k and fc + 1, using Lemma |5] we can write, 

Ckiq,w,b) = p{k)(T - w - T]bj + p{k + ri(j)i{w,b)y 



Using ^ and (1% . we obtain 4>i{'w, b) —E, max{6, R} 
we have. 



T-w 
2ri 



. Substituting for (f>i{w, b) in the above expression 



Ck{q,w,b) = p{k)(T - w - -qb^ +p{k + 1)^ 



T — w 



r;E[max{6, R}] 



Recall 

Ck{p, w, b) = p{k) {t - w - ribj + 

K r 

n=fc+l '- 

Using ( fTSI ) and Q we can write, 

K 

Ckip,w,b) < p{k)(T - w - T]b^ + ^ ]3(n)Efe C/fe+i - 7/max{6, i?fc+i} 



ri=fc+l 
K 



w, n 



p{k)(T -w - T]b^ + ^ p{ 



n=k+l 
K 



< p{k)(T - w - -nb) + J2 Pi 



ri=fc+l 



T ~ w 
n - k + 1 

T — w 



max{6, R} 



max{6, R} 



= p{k) I^T -w-rjb] + [1- p{k) ) ( ^ ^ - 77E max{&, i?} 
Cfc(q,u;,6). 



(39) 



Appendix III 
Proof of Lemmas in SectionIvT] 

A. Proof of Lemma [7| 

Proof of^l : Let represent the cummulative distribution function (cdf) of i?. For b E [0, i?], the cdf of max{6, R} 
is. 



Fniiix{b,R}ir) — 

using which /3(6) in (l2Tl i can be written down as. 



if r < & 

Fnir) if r > 6, 



(1 - ^'max{6,fl}(''))dr 

(> + f (1-F„W)*-X. 



T 
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I3'{b) ~ Fji{b) > and = fnib) > implies that /3 is continuous, increasing and convex in b. 

Proof of ^2: From (Q note that (3{R) < R. Also (3 is convex (from Lemma |2]1). Hence we can write, 

m < ^m + =m 

K it 

< b. 



Proof of ^3: Let g{b) = b- Then, g{0) < and g{R) > (because f3{R) < R). Also g{b) is continuous 

(being differentiable) on [0, R]. Hence, 3 an a e [0, R) such that g(a) = 0. 

Suppose 3 an a' > a such that g{a') = 0. Then by convexity of (3 (from Lemma|71l), 

R — a R — a 

which implies that P(R) > R. Contradicts the fact that, P(R) < R. 



Proof of ^4: Again consider g{b) ~ b — P{b). g{b) is continuous (being differentiable) on [0, i?]. Suppose 3 
b e {a,'R] such that /3i(6) > b, then g{h) < and g(R) > 0. This implies that 3 6' in [b,R) such that g{b') = 0. 
Contradicts the uniqueness of a shown in Lemma |2l3. Similarly it can be shown that /3{b) > 6 for 6 e [0, a). ■ 

B. Proof of Lemma |S] 

The proof is by induction. From (l24l i and dHJ, we see that the result is already true for k = N — 1. Next we 
will prove it for k = N — 2. Let us evaluate the value function at stage N — 2 and simplify using the expression 
for J^_i (from (|24]|). 



r]b,E 



7]b,E 



^N-i +min| -77max{6, ^^_J,-ry/3i(max{6, 



T 



min < — rib, — — 77I 
I N 

= mm I - 776, -77/32 (6) | , 



where 



Mb) 



E 



/3i(max{&,i?})| 
A(max{6,i?})| 



T 

riN' 
max{6, R} 



(40) 
(41) 

Next if 6 > a then from 



f^2{b) > f3i{b) easily follows because E 

Lemma |7j2 and |7j4 we have max{6, i?} > /3i(max{6, i?}) so that max |&, i?, /3i(max{6, = max{6, i?}. 

Therefore, 

T 

I3iib). (42) 



(32{b) = E max{6,i?} 
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Hence we have shown that the Lemma holds for N — 2. Suppose that the Lemma (i.e., ( |25] l. ( |26] l and the ordering 
property) holds for some fc, 1 < fc < iV — 1, then following the same arguments which were used to obtain (|40] | 
and i4l[ (replace TV — 2 by fc — 1 and iV — 1 by fc) we can show that (IZST i and ( |26] | hold for stage fc — 1 as well. 
The ordering property can be easily shown to hold for stage fc — 1 by using the ordering property for stage fc. ■ 

Appendix IV 

One-Hop Performance for Different Reward Distributions (/a) and Initial Beliefs (po) 

In Section [VII- A| we performed simulations to compare the one-hop performance of the various policies (recall the 
description of the implemented policies from Section lVII-All l. There we had considered the context of geographical 
forwarding (which was the primary motivation for our work), so that the reward associated with a relay is the progress 
it makes towards the sink (see Fig. [8]). Also the initial belief we had considered was truncated Poisson (of mean 
A = 10) with if = 50 (recall that K is the bound on the number of relays). From Tables Ull and HiH we were able 
to draw the following conclusions: For large values of target 7, 

• The average delay of tta-simpl and ttinner is close to ttcomdp, which is the optimal policy. 

• The difference in the delays, incurred by hinner and n outer, is small. 

• T^A-COMPD incurs a larger delay. 

For smaller values of target 7, we see that all the policies incur similar average delay. 

In this appendix, to comment more on these conclusions, we have performed simulations for few other examples, 
with different pairs of reward distributions (//?) and intial beliefs {po). In each of these examples, the good 
performance of the policy t^a-simpl is observed. We have fixed T = 1 and normalize the rewards to take 
values within the interval [0, 1] for all the examples. The first two examples extend the scenario of geographical 
forwarding mentioned earlier while in the next two we simply take R to have uniform and truncated Gaussian 
distributions, respectively. As in Section [VII- Al we discretize the state space and approximate all the pmfs with pdfs 
in simulations. For each example we tabulate the results (i.e., average reward values for few values of target 7 and 
the corresponding average delays) which have the same explanation as for the Tables [III and [Till (see the explanation 
following Tables mi and Hn ). 

EXAMPLE 1 

• Reward: We consider the same scenario of geographical forwarding as in Section IVII-AI Here we allow the 
reward to be a function of the progress. Let Zi be the progress made by relay i. Small values of Zi are not 
favourable because the packet does not make significant progress towards the sink. On the other hand when 
Zi is large, the attenuation of the signal transmitted from the source to the relay will be large. This means a 
higher power is required to achieve a given packet error rate. Thus, we want to penalize both small and large 
values of Zi. This motivates us to choose the reward function to be Ri = —aiZi log(^). Ri is maximum at 
Zi = We have choosen 02 = 0.4e. ai is a constant used to normalize the maximum reward value to 1. 
Using fz in ( [30l l one can obtain Jr. 
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• Initial Belief: Bound on the number of relays is if = 40. Initial belief is truncated Poisson with parameter 5 
i.e., for n — 1,2, ■ ■ ■ , K, po{n) — c^e~^, where c is the normalization constant. 
Results are tabulated in Tables |IV] and |V] 



Target 7 


0.7800 


0.8200 


0.8600 


0.9000 


0.9400 


^[Rttcom DP ] 


0.7751 


0.8164 


0.8628 


0.9018 


0.9402 




0.7755 


0.8195 


0.8663 


0.8991 


0.9397 




0.7826 


0.8216 


0.8593 


0.9006 


0.9407 


^[RttA-COM DP ] 


0.7730 


0.8184 


0.8651 


0.8986 


0.9401 


^[Rtta-SIMPl] 


0.7755 


0.8166 


0.8589 


0.8992 


0.9406 



TABLE IV 

EXAMPLE 1: TARGET 7 AND CORRESPONDING AVERAGE REWARDS 



Target 7 


0.7800 


0.8200 


0.8600 


0.9000 


0.9400 


^{^■^COMDp\ 


0.1950 


0.2082 


0.2340 


0.2715 


0.3649 


^[^■^INNEp] 


0.1963 


0.2150 


0.2471 


0.2839 


0.4005 


'^[^T^OUTEp] 


0.1993 


0.2168 


0.2431 


0.2865 


0.4078 


^{D-^A-COMDP ] 


0.1963 


0.2164 


0.2499 


0.2871 


0.4153 


^[^ta-simpl\ 


0.1963 


0.2133 


0.2411 


0.2840 


0.4056 



TABLE V 

EXAMPLE I: AVERAGE DELAYS CORRESPONDING TO AVERAGE REWARDS IN TABLeITVI 



EXAMPLE 2 

• Reward: Again we consider the scenario of geographical forwarding. Let Zi be the progress made by a relay 
i and Hi be the (normalized) data rate from the source to the relay i. Hi is a random variable which takes 
values from the set {0.2, 0.4, 0.6, 0.8, 1.0}. For small (large) values of Zi there is a high (low) probability that 
the data rates are good. Thus as Zi increases we want the probability of Hi taking larger values to decrease. 
Therefore when = z we set F{Hi = h\Z, ^ z) ^ a^he^'^^'' for h £ {0.2, • • • , 1.0}. a^-'^''^, as a 
function of h, attains maximum at ^ so that as Zi increases Hi takes lower values with high probability. We 
have choosen d = 10. is a constant to normalize the total probability to 1. Finally the reward associated 
with relay i is Ri — c\Zi + ciHi. We choose c\ = c^ — 0.5. 

• Initial Belief: K = 30 and po is binomial with parameter 0.5 i.e., for n = 1,2, ■ ■ ■ , K, po{n) — c(^)0.5" 
where c is the normaUzation contant. Such an initial belief is appropriate if initially during deployment the 
source had K potential relays and at the time when the source has a packet (which happens after a significant 
amount of time because the events are rare), the probability with which a relay has not failed is 0.5 (we have 
ignored the case where all the relays have failed). 

Results are tabulated in Tables [Vl] and IVIII 
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Target 7 


0.4500 


0.5000 


0.5500 


0.6000 


0.6500 




0.4510 


0.5058 


0.5489 


0.6002 


0.6500 




0.4510 


0.5050 


0.5508 


0.6013 


0.6503 




0.4510 


0.5056 


0.5506 


0.5989 


0.6500 


^[RlTA-COMDP ] 


0.4510 


0.5066 


0.5500 


0.6002 


0.6500 




0.4510 


0.5088 


0.5518 


0.6013 


0.6499 



TABLE VI 

EXAMPLE 2: TARGET 7 AND CORRESPONDING AVERAGE REWARDS 



Target 7 


0.4500 


0.5000 


0.5500 


0.6000 


0.6500 




0.0638 


0.0830 


0.1179 


0.2075 


0.4246 




0.0638 


0.0835 


0.1218 


0.2155 


0.4456 




0.0638 


0.0836 


0.1225 


0.2159 


0.4582 


^{Dtta-COMDP ] 


0.0638 


0.0843 


0.1213 


0.2146 


0.4510 


'^[Dtta_s,mpl\ 


0.0638 


0.0858 


0.1225 


0.2159 


0.4443 



TABLE VII 

EXAMPLE 2: AVERAGE DELAYS CORRESPONDING TO AVERAGE REWARDS IN TABLeIVTI 

EXAMPLE 3 

• Reward: R is distributed uniformly on [0, 1]. 

• Initial Belief: X = 20 and po is binomial with parameter 0.5. 
Results are tabulated in Tables IVIIII and |IX] 



Target 7 


0.7000 


0.7500 


0.8000 


0.8500 


0.9000 


^[^T^COMDP ] 


0.7093 


0.7566 


0.8030 


0.8503 


0.9000 


]E[R7rj-jv]VER] 


0.7102 


0.7588 


0.7984 


0.8512 


0.9001 


^[^■K OUT E r\ 


0.7099 


0.7523 


0.8004 


0.8500 


0.8999 


^[RtA~COM DP ] 


0.7135 


0.7580 


0.8040 


0.8488 


0.9001 


^{^t^a-simpl\ 


0.7119 


0.7538 


0.7968 


0.8485 


0.9009 



TABLE VIII 

EXAMPLE 3: TARGET 7 AND CORRESPONDING AVERAGE REWARDS 



Target 7 


0.7000 


0.7500 


0.8000 


0.8500 


0.9000 


^i^^COMDp] 


0.1557 


0.1846 


0.2279 


0.3033 


0.5115 




0.1588 


0.1910 


0.2288 


0.3180 


0.5443 


^[-^^outer\ 


0.1594 


0.1870 


0.2339 


0.3201 


0.5515 


^[Dtta-COM DP ] 


0.1610 


0.1909 


0.2367 


0.3136 


0.5995 


^[Dtta^SIMPl] 


0.1600 


0.1872 


0.2279 


0.3107 


0.5529 



TABLE IX 

EXAMPLE 3: AVERAGE DELAYS CORRESPONDING TO AVERAGE REWARDS IN TABLe[VIII1 
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EXAMPLE 4 

• Reward: Truncated Gaussian of mean 0.5 and variance 1 i.e., for r G [0, 1], /^(r) — -^^e.' where c is 
the normahzation constant. 

• Initial BeUef: K = 15 and po is uniform on {1, 2, • • • , K}. 
Results are tabulated in Tables |X] and IXII 



Target 7 


0.6400 


0.6800 


0.7200 


0.7600 


0.8000 


^[Rttcom DP ] 


0.6500 


0.6725 


0.7208 


0.7625 


0.8001 




0.6487 


0.6728 


0.7240 


0.7601 


0.7997 


^[R^outer] 


0.6388 


0.6807 


0.7213 


0.7600 


0.7998 


^[RTTA-COMnP ] 


0.6259 


0.6791 


0.7225 


0.7618 


0.7997 


^[Rtta-SIMPl] 


0.6302 


0.6769 


0.7146 


0.7607 


0.8009 



TABLE X 

EXAMPLE 4: TARGET 7 AND CORRESPONDING AVERAGE REWARDS 



Target 7 


0.6400 


0.6800 


0.7200 


0.7600 


0.8000 


^i^T'COMDp] 


0.2092 


0.2222 


0.2600 


0.3060 


0.3799 




0.2274 


0.2473 


0.2981 


0.3478 


0.4386 


^[^TTQUTER^ 


0.2307 


0.2622 


0.3031 


0.3576 


0.4460 


^[-^'^A-COM DP ] 


0.2290 


0.2689 


0.3122 


0.3740 


0.4735 


^[Dtta^SIMPl] 


0.2218 


0.2577 


0.2924 


0.3532 


0.4473 



TABLE XI 

EXAMPLE 4: AVERAGE DELAYS CORRESPONDING TO AVERAGE REWARDS IN TABLeIXI 



